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We  tirst  study  the  problem  of  reaching  agreement  in  the  presence  of  failures.  A  simple 
argument  derived  from  the  case  of  synchronous  processes  shows  that  at  least  time  (/  +  l)d 
is  required  to  tolerate  /  failures,  while  time  (/  +  l)Cd  is  sufl&cient  to  tolerate  /  stopping  or 
omission  failures  by  directly  simulating  the  rounds  of  any  synchronous  consensus  algorithm. 
We  narrow  this  gap  for  omission  failures,  building  on  the  netirly  optimal  algorithm  of  Attiya, 
Dwork,  Lynch,  and  Stockmeyer  which  tolerates  only  stopping  feiilures.  If  fewer  than  half  the 
processes  are  faulty  (n  >  2/  +  1),  then  the  running  time  of  our  algorithm  is  4(/  +  l)d  + 
Cd,  which  is  within  a  factor  of  4  of  optimal  and  has  minimal  dependency  on  the  timing 
uncertainty  factor  C.  If  more  than  half  the  processes  are  faulty,  then  a  more  complicated 
analysis  shows  the  running  time  is  increased  by  approximately  a  factor  of  min(^^,  VC).  VV'e 
also  present  a  general  simulation  for  n  >  3/  +  1  tolerant  of  Byzantine  failures  that  simulates 
any  synchronous  algorithm  at  a  cost  of  time  2C d  +  d  per  round. 

Finally,  motivated  by  the  message  inefficiency  of  our  consensus  algorithm  for  omission 
failures,  we  define  a  more  realistic  model  of  message  links  by  limiting  their  capacity.  If 
messages  are  sent  too  frequently  on  these  message  links,  they  may  incur  delay  greater 
than  d.  For  message  links  with  capacity  /x,  we  prove  nearly  tight  upper  and  lower  bounds  of 
min(2Cd  +  d,  C^d/fi  Cd-\-  d)  and  min(2Cd  +  d//z,  C^d/n  +  Cd  -j-  d)  respectively  for  the 
time  needed  to  detect  stopping  failures. 
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Abstract 


In  real  distributed  systems,  processes  may  have  only  inexact  information  about  the 
amount  of  real  time  needed  for  primitive  operations  such  as  process  steps.  This  thesis 
studies  the  effect  of  this  timing  uncertainty  on  the  real-time  behavior  of  distributed  systems. 
We  consider  a  semi-synchronous  model  in  which  the  amount  of  real  time  between  process 
steps  is  known  to  be  in  the  interval  [ci,  C2]  and  every  message  is  known  to  be  delivered  within 
time  d  of  when  it  is  sent.  We  use  C  =  C2/C1  a5  a  measure  of  the  timing  uncertainty. 

We  first  study  the  problem  of  reaching  agreement  in  the  presence  of  failures.  A.  simple 
argument  derived  from  the  ca.se  of  synchronous  processes  shows  that  at  least  time  [f  ~  l)d 
is  required  to  tolerate  /  failures,  while  time  (/  -!-  ].)Cd  is  sufficient  to  tolerate  /  stopping  or 
omission  failures  by  directly  simulating  the  rounds  of  any  synchronous  consensus  algorithm. 
We  narrow  this  gap  for  omission  failures,  building  on  the  nearly  optimal  algorithm  of  Attiya. 
Dwork,  Lynch,  and  Stockmeyer  which  tolerates  only  stopping  failures.  If  fewer  than  half  the 
processes  are  faulty  (n  >  2/  +  1),  then  the  running  time  of  our  algorithm  is  4(/  l)d 
Cd,  which  is  within  a  factor  of  4  of  optimal  and  has  minimal  dependency  on  the  timing 
uncertainty  factor  C .  If  more  than  half  the  processes  are  faulty,  then  a  more  complicated 
analysis  shows  the  running  time  is  increased  by  approximately  a  factor  of  min(;^,  \/C).  We 
also  present  a  general  simulation  for  n  >  3/4-1  tolerant  of  Byzantine  failures  that  simulates 
any  synchronous  algorithm  at  a  cost  of  time  2Cd  4-  d  per  round. 

Finally,  motivated  by  the  message  inefficiency  of  our  consensus  algorithm  for  omission 
failures,  we  define  a  more  realistic  model  of  message  links  by  limiting  their  capacity.  If 
messages  are  sent  too  frequently  on  these  message  links,  they  may  incur  delay  greater 
than  d.  For  message  links  with  capacity  fi,  we  prove  nearly  tight  upper  and  lower  bounds  of 
min(2Cd  4-  d,  C^d/ 4-  Cd  4-  d)  and  min(2Cd  4-  d//i,  C^dj n  -i-  Cd  4-  d)  respectively  for  the 
time  needed  to  detect  stopping  failures. 
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Chapter  1 
Introduction 


In  real  distributed  systems,  processes  are  likely  to  be  neither  perfectly  synchronous  nor  com¬ 
pletely  asynchronous.  Many  systems  lie  somewhere  between  these  two  extremes  and  can  thus 
be  more  accurately  modeled  by  a  semi-synchronous  model  in  which  processes  have  inexact 
knowledge  about  real  time.  In  our  model,  the  degree  of  asychrony  is  captured  by  a  parameter 
which  we  call  the  processes’  timing  uncertainty.  We  will  be  particularly  interested  in  how 
the  magnitude  of  timing  uncertainty  affects  the  time  complexity  of  distributed  computing 
problems.  In  particular,  we  study  the  the  time  needed  to  reach  consensus  in  the  presence 
of  omission  failures  and  in  the  presence  of  Byzantine  failures.  We  also  introduce  a  model 
of  message  links  with  bounded-capacity  and  study  the  time  needed  to  detect  failures  in  a 
system  using  these  message  links. 

In  a  synchronous  system,  processors  have  perfectly  synchronized  clocks  and  distributed 
algorithms  are  often  broken  up  into  rounds  of  communication.  In  a  single  round  of  com¬ 
munication,  each  processor  may  receive  messages  from  other  processors,  perform  some  local 
computation,  and  then  send  messages  to  other  processors.  The  time  required  to  perform 
local  operations  is  generally  assumed  to  be  negligible  and  the  time  complexity  of  algorithms 
is  therefore  measured  by  the  number  of  rounds  of  communication  required.  In  an  asyn¬ 
chronous  system,  the  delay  of  messages  is  arbitrary  and  unbounded  (or  the  relative  rates 
of  different  processors  are  unbounded).  The  time  complexity  of  an  asynchronous  algorithm 
is  usually  measured  by  letting  one  time  unit  equal  the  maximum  delay  of  any  message 
([Gal82,  Awe85]). 

The  model  we  use  is  a  slightly  simplified  version  of  the  semi-synchronous  model  intro¬ 
duced  in  [AL89],  v.hich  is  in  turn  based  on  the  formal  model  of  timed  automata  in  [MMT90]. 
In  this  model,  processors  have  inexact  knowledge  about  the  time  needed  to  perform  certain 
primitive  operations.  The  model  is  formally  described  in  Section  2.1,  but  is  very  simple: 
every  message  is  delivered  within  time  d  of  when  it  is  sent  and  the  amount  of  time  between 
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any  two  consecutive  steps  of  any  process  is  in  the  interval  [ci,C2].  Because  process  steps  are 
the  only  events  for  which  there  is  a  lower  bound,  a  process  can  deduce  a  lower  bound  on 
the  amount  of  time  for  any  interval  of  events  only  by  counting  the  number  of  steps  it  takes 
in  that  interval.  For  instance,  to  ensure  that  time  d  elapses  over  an  interval  of  events,  a 
processor  must  count  d/ci  of  its  local  steps,  after  these  events  it  knows  that  at  least  time 
Cl  •  d/ci  =  d  (and  at  most  C2  •  d/ci)  has  elapsed.  We  will  be  particularly  interested  in  how 
this  timing  uncertainty  factor  of  cj/ci,  henceforth  denoted  C,  affects  the  time  complexity  of 
problems  relative  to  their  synchronous  time  complexity. 

Of  particular  interest  are  problems  that  are  intractable  in  an  asynchronous  setting  yet 
have  solutions  with  tight  bounds  in  the  synchronous  setting.  A  simple  example  is  the  basic 
task  of  detecting  the  failure  of  stopped  processes.  Clearly,  if  there  is  no  bound  on  message 
delay  or  relative  process  step  time,  then  failures  can  never  be  detected  with  certainty;  in  a 
synchronous  system,  any  stopping  failure  can  be  detected  within  approximately  the  maxi¬ 
mum  message  delay  time.  Another  natural  candidate  is  the  consensus  problem.  It  is  well 
known  that  a  completely  asynchronous  algorithm  for  consensus  cannot  tolerate  the  failure 
of  even  one  process,  whereas  exactly  /  -|- 1  rounds  of  synchronous  communication  are  needed 
to  tolerate  /  failures  in  a  synchronous  system. 


1.1  Reaching  consensus — known  time  bounds 

The  problem  of  reaching  consensus  in  the  presence  of  failures  is  one  of  the  most  well-studied 
problems  in  distributed  computing.  We  consider  the  version  of  this  problem  for  a  system 
of  n  deterministic  processes  some  /  of  which  may  fail,  completely  connected  by  a  reliable 
message  system.  The  processes  begin  executing  at  the  same  time,  each  with  a  private  binary 
input,  and  must  each  decide  on  a  binary  value  such  that  no  two  nonfaulty  processes  decide 
differently  and  if  all  processes  begin  with  value  v  then  v  is  the  decision  of  all  nonfaulty 
processes.  In  this  thesis,  we  consider  two  kinds  of  process  failure;  send-omission  failures,  by 
which  a  process  may  unwittingly  omit  messages  of  an  algorithm,  and  Byzantine  failures,  by 
which  a  process  may  exhibit  arbitrary  behavior. 

It  is  well  known  ([FLP85])  that  in  an  eisynchronous  system,  this  problem  cannot  be  solved 
deteministically  even  if  the  only  failure  to  be  tolerated  is  the  unannounced  halting  (stopping) 
of  a  single  process.  The  work  of  [DDS87]  methodically  explores  the  synchrony  necessary  to 
reach  consensus;  they  show  that  if  there  is  no  upper  bound  on  message  delay  or  there  is  no 
upper  bound  on  the  relative  rate  of  process  steps — if  any  of  our  bounds  d,  Ci,  or  C2  does  not 
hold — then  there  is  no  deterministic  solution  tolerating  even  a  single  stopping  failure. 

The  time  complexity  of  the  consensus  problem  has  been  well  studied  in  the  synchronous 
rounds  model  (see,  foi  example,  [LSP82,  PSL80,  FLS2,  DSS3,  DLM82]).  It  is  well  known 
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that  /  +  !  rounds  of  communication  are  both  sufficient  ([PSL80])  and  necessary  ([FL82,  M85, 
DM86,  CD86])  to  reach  consensus,  regardless  of  the  severity  of  failures  (stopping,  om’-oion, 
or  Byzantine).  In  [DLS88],  the  problem  was  studied  using  a  model  of  partial  synchrony  in 
which  upper  bounds  on  message  delivery  time  and/or  processes’  relative  step  rates  exist, 
but  they  are  unknown  a  priori  to  the  processes.  The  algorithms  of  [DLS88]  are  concerned 
with  fault  tolerance  rather  than  timing  efficiency,  and  therefore  translate  to  relatively  slow 
algorithms  for  our  model. 

For  our  semi-synchronous  model,  a  lower  bound  of  (/-(-  l)d  is  implied  by  the  synchronous 
lower  bound  of  /  +  1  rounds,  via  a  straightforward  transformation  of  any  algorithm  for  our 
model  to  an  an  algorithm  for  the  synchronous  model.  For  stopping  and  omission  failures, 
any  synchronous  round-based  algorithm  may  be  simulated  directly,  yielding  an  algorithm 
for  our  model  with  a  running  time  approximately  C  times  the  synchronous  running  <^ime. 
This  simulation  strategy  is  described  in  Section  3.1.  Thus,  upper  bounds  of  approximately 
(/  -|-  \)Cd  are  easily  derived.  For  Byzantine  failures,  it  is  not  clear  how  to  simulate  a 
synchronous  algorithm  correctly. 

In  [ADLS90],  Attiya,  Dwork,  Lynch,  and  Stockmeyer  prove  nearly  tight  upper  and  lower 
bounds  on  the  time  to  reach  consensus  in  the  presence  of  stopping  failures.  Surprisingly, 
they  give  a  clever  algorithm  for  consensus  that  runs  in  time  2fd  +  Cd,  much  faster  than 
a  direct  simulation  when  C  is  large.  They  also  show  a  lower  bound  of  {f  —  \)d  C d  \n  a 
proof  that  combines  the  arguments  of  the  synchronous  lower  bound  with  techniques  from 
asynchronous  lower  bounds  and  retiming  techniques  for  our  semi-synchronous  model. 


1.2  Related  work 


Current  research  also  concentrating  on  the  real  time  complexity  of  the  consensus  problem 
appears  in  [SDC90].  There,  processes  are  assumed  to  have  clocks  that  are  synchronized  to 
within  a  fixed  additive  error.  In  contrast  to  our  results,  the  results  of  [SDC90]  are  stated 
in  terms  of  process  clock  time,  not  absolute  time.  The  relationship  between  those  results 
and  ours  is  unclear;  a  better  understanding  of  the  differences  between  two  different  models 
is  posed  as  a  direction  for  further  research  in  Section  6.2. 

A  related  model  is  studied  in  [HK89]  to  explore  the  time  complexity  of  detecting  '"ailures 
along  a  network  path.  This  model  assumes  synchronous  processes  but  differentiates  between 
the  (known)  a  priori  worst-case  bound  on  message  delay,  A,  and  the  (unknown)  actual  worst- 
case  message  delay  in  a  given  execution,  8.  Since  8  may  be  much  less  than  A,  it  is  desirable 
for  algorithms  to  have  minimal  dependency  on  A.  This  model  raises  a  concern  similar  to 
that  raised  by  our  model:  detecting  the  absence  of  a  message  may  be  much  more  costlv  than 
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receiving  the  message.  Our  algorithms  run  equally  well  in  this  model;  we  remark  on  how  our 
bounds  translate  to  this  model  in  Section  6.1. 

Other  work  in  this  area  includes  the  extensive  literature  on  clock  synchronization  algo¬ 
rithms  (see  [SWLS6]  for  a  survey).  Other  problems  recently  studied  in  our  model  of  timing 
uncertainty  include  the  problem  of  mutual  exclusion  ([AL89])  and  the  complexity  of  a  net¬ 
work  synchronizer  algorithm  ([.AM90]). 


1.3  Results  of  this  thesis 

1.3.1  Consensus  in  the  presence  of  omission  failures 

In  Chapter  3,  we  strengthen  the  algorithm  of  [.401890]  to  tolerate  omission  failures.  The 
resulting  algorithm  has  a  running  time  of  4(/-i-l)d-|-Cd  for  n  >  2f  +  l.  This  is  approximately 
within  a  constant  factor  (4)  of  the  lower  bounds  of  (/  +  l)d  and  {f  —  l)d  +  Cd  ([.4DLS90]) 
and  minimizes  the  dependence  on  the  timing  uncertainty  C. 

For  n  <  2/,  a  more  involved  analysis  bounds  the  running  time  by  two  different  quantities 
simultaneously;  one  bound  is  dependent  on  the  ratio  and  the  other  is  dependent  on  \/C. 
We  first  derive  the  bound  (3;^  -b  5)(/  +  l)d  -\-  Cd  using  a  finer  analysis  that  is  similar  in 
spirit  to  the  analysis  for  n  >  2/  -f  1-  We  then  show  that  {2\/C  +  6)(/  +  1  )c/  -f-  Cd  is  also  a 
bound  on  the  running  time  using  a  simple  but  different  argument. 

1.3.2  Consensus  in  the  presence  of  Byzantine  failures 

In  Chapter  4,  we  present  a  simulation  algorithm  using  3/  +  1  processes  and  tolerating  / 
arbitrary  failures.  The  algorithm  simulates  any  synchronous  round-based  algorithm  tolerant 
of  /  arbitrary  failures  using  roughly  time  2Cd  -b  d  per  round. 

The  simulation  works  by  keeping  processes  loosely  synchronized  to  ensure  that  a  nonfaulty 
process  does  not  advance  to  round  r  until  it  has  received  a  round  r  —  1  message  from  every 
nonfaulty  process.  The  partial  synchronization  works  by  using  a  combination  of  two  criteria 
for  advancing  to  further  phases,  one  based  on  elapsed  local  time  and  the  other  based  on 
messages  received. 

It  follows  that  any  of  the  known  synchronous  consensus  algorithms  tolerating  /  Byzantine 
failures  and  taking  /  -b  1  rounds  can  be  run  in  our  model  in  time  (/  -b  l)(2Cd  +  d). 
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1.3.3  Timeouts  using  bounded-capacity  message  links 


In  Chapter  5,  we  define  a  realistic  restriction  on  the  message  links  of  our  model  and  examine 
its  effect  on  the  time  needed  to  detect  stopping  failures.  According  to  the  model  of  [AL89] 
and  [ADLS90]  (used  in  Chapter  3),  every  message  sent  by  a  process  is  delivered  within  time 
d  of  when  it  is  sent,  regardless  of  the  rate  at  which  messages  are  sent.  In  reality,  if  a  link  is 
flooded  with  messages,  their  delay  may  be  much  greater.  Our  algorithm  for  omission  failures 
and  the  algorithm  of  [ADLS90]  ignore  this  consideration  by  requiring  a  process  to  send  a 
message  at  every  step  it  takes.  This  enables  failures  to  be  detected  as  quickly  as  possible, 
but  is  grossly  inefficient  in  its  use  of  messages.  We  therefore  define  a  more  realistic  model  of 
message  delay  that  takes  into  consideration  the  rate  at  which  messages  are  sent. 

We  give  a  clean,  modular  definition  of  a  message  link  of  arbitrary  capacity  ft.  Such  a  link 
my  may  be  thought  of  as  allowing  the  “progress”  of  only  /i  messages  at  any  time.  We  then 
derive  nearly  tight  bounds  on  the  time  needed  to  detect  a  stopping  failure  using  such  links. 
Two  easy  algorithms  guarantee  that  the  time  between  a  failure  and  its  detection  is  at  most 
2Cd  +  d  and  C^dj jx  +  Cd  +  d,  respectively.  We  show  that  these  bounds  are  nearly  optimal 
by  proving  a  lower  bound  of  the  lesser  of  2Cd  +  d//x  and  C^df  ^  +  Cd  +  d. 
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Chapter  2 

Model  and  Definitions 


Our  underlying  formal  model  is  essentially  the  same  ais  that  used  in  [ADLS90].  Our  model 
differs  by  assuming  for  ea^e  of  presentation  that  all  messages  are  delivered  in  the  order  sent 
and  that  processes  begin  executing  the  algorithm  at  the  same  time.  The  former  assumption 
is  not  used  in  our  algorithm  for  Byzantine  failures  and  is  easily  removed  from  our  algorithm 
for  omission  failures  by  employing  a  more  complicated  protocol  for  receiving  messages.  The 
latter  assumption  is  avoided  in  [ADLS90]  by  instead  providing  a  special  individual  input 
event  for  each  process,  in  which  it  receives  its  initial  value  for  the  consensus  protocol.  In 
measuring  the  time  complexity  of  the  algorithm,  time  is  mecisured  only  from  the  earliest 
time  that  all  processes  have  received  an  input.  Using  the  same  formalism,  our  algorithm  for 
omissions  failures  works  equally  well  without  the  assumption  of  a  synchronized  start.  This  is 
not  true,  however,  for  our  algorithm  tolerating  Byzantine  failures,  where  we  make  use  of  the 
fact  that  all  nonfaulty  processes  begin  executing  the  algorithm  at  the  same  time.  Without 
this  assumption,  the  problem  is  complicated  by  the  need  to  determine  when  all  processes 
have  received  inputs.  Also,  in  addition  to  allowing  stronger  failures  than  [ADLS90],  we 
assume  that  processes  know  the  number  of  failures,  /,  to  be  tolerated. 


2.1  Formal  model 

We  consider  a  system  of  n  processes  1, . . . ,  n.  Each  process  is  a  deterministic  state  machine 
with  possibly  an  infinite  number  of  states  and  a  distinguished  start  state. 

A  configuration  is  a  vector  C  consisting  of  the  local  states  of  each  process.  Let  s<(LC) 
denote  the  state  of  process  i  in  configuration  C.  We  model  a  computation  of  the  algorithm  as 
a  sequence  of  configurations  alternated  with  events.  Each  event  tt  is  either  the  computation 
step  of  a  single  process  or  the  delivery  of  a  message  to  a  process.  The  local  protocol  of  process 


i  consists  of  two  transition  functions,  M,  for  message  delivery  events,  and  5,  for  computation 
events.  Transition  function  A/,  is  applied  to  a  state  of  the  process  and  a  message  (taken 
from  some  finite  message  alphabet)  and  returns  a  state.  (So,  for  example,  a  process  can 
“remember”  a  message  that  wcis  delivered  to  it.)  A  message  delivery  event  tt  is  of  the  form 
(m,i),  specifying  the  message  m  delivered  and  the  recipient  process,  i.  Transition  function 
Si  is  a  applied  to  a  state  of  the  process  and  returns  a  state  and  a  finite  set  of  messages  to 
be  sent.^  A  computation  step  tt  is  of  the  form  (i,  A/),  specifing  the  process  i  taking  the  step 
and  the  set  of  messages  M  it  sends  in  that  step.  (Af  should  be  interpreted  as  the  messages 
the  process  actually  sends  at  that  step  in  the  execution;  if  the  process  is  faulty,  this  may  not 
correspond  to  those  determined  by  the  transition  function.) 

An  execution  is  an  infinite  sequence  of  alternating  configurations  and  events,  q  =  Co,  tti. 
Cl, . . . ,  TTj,  Cj, . . .,  where  Co  is  the  vector  of  start  states  and  each  configuration  C,  follows 
from  the  previous  configuration  C,_i  and  the  intervening  event  tt^,  according  to  the  state 
transitions  of  the  process  at  which  event  tt,  occurs.  This  means  that  if  event  tTj  is  an  event 
at  process  x  then  (1)  for  y  ^  x,  st{y,Cj-\)  =  st(y,Cj),  (2)  if  tt  is  a  message  delivery  event 
specifying  the  delivery  of  message  m  then  st{i,Cj)  is  the  result  of  applying  A/,  to  Cj-i  and 
m,  and  (3)  if  tt  is  a  computation  event,  then  st(i,Cj)  is  the  result  of  applying  5,  to  Cj_i. 
Also,  each  message  sent  is  delivered  after  it  is  sent  and  no  unsent  “messages”  are  delivered. 

A  timed  event  is  a  pair  (7r,f),  where  ir  is  an  event  and  t,  the  “time”,  is  a  nonnegative 
real  number.  A  timed  sequence  is  an  infinite  sequence  of  alternating  configurations  and 
timed  events  a  =  Co,  (tti,  fi),  Ci, . . . ,  (tt^,  tj),  Cj, . . where  the  times  are  nondecreasing  and 
unbounded. 

Fix  real  numbers  Ci,  C2,  and  d,  where  0  <  Ci  <  C2  <  00  and  0  <  d  <  oo.  Letting  q  be  a 
timed  sequence  as  above,  we  say  that  a  is  a  timed  execution  if 

1.  Co,  TTi,  Cl, . . . ,  TTj,  Cj, ...  is  an  execution; 

2.  The  first  step  of  each  process  is  at  time  0; 

3.  There  are  infinitely  many  computation  steps  for  each  process; 

4.  If  TT,  and  tTj  are  consecutive  computation  steps  of  the  same  process,  then  ci  <  tj  —  t,  < 
C2;  and 

5.  If  message  m  is  sent  to  process  i  during  computation  event  tTj  then  it  is  delivered  to 
process  i  during  message  delivery  event  tt^,  j  <  k,  such  that  0  <  tk  —  tj  <  d. 

In  our  timing  analysis  (but  not  in  our  algorithms  or  correctness  proofs),  we  make  the 
assumption  that  C2  <C  d  and  therefore  make  the  approximation  d  +  C2  ~  d. 

’In  all  our  algorithms,  a  process  always  sends  the  same  message  (at  most  one  per  step)  to  all  processes, 
including  itself. 
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2.1.1  Omission  failures 


A  process  i  suffers  an  omission  failure  in  execution  a  if  and  only  if  there  is  a  computation 
step  TTj  of  process  i  in  a  specifying  a  set  of  messages  that  is  a  strict  subset  of  the  messages 
determined  by  the  transition  function  5,  applied  to  Recall  that  computation 

step  TTj  specifies  the  messages  actually  sent  by  i  during  that  step  of  execution  a.  Note  that 
according  to  our  definition  of  an  execution,  st{i,Cj)  must  be  the  result  of  applying  S,  to 
st{i,Cj-i),  regardless  of  the  messages  specified  by  7r_,.  This  implies  that  the  process  itself  is 
“unaware”  of  its  failure  and,  unless  informed  about  it.  continues  executing  as  if  it  had  not 
failed.  (This  kind  of  failure  is  sometimes  called  a  send-omission  failure.)  If  the  algorithm 
requires  j  to  broadcast  a  message  to  all  processes,  but  j  does  not  send  a  message  to  i.  then 
we  say  that  “j  omits  to  z”  or  that  this  broadcast  is  "unsuccessful". 


2.1.2  Byzantine  failures 

A  process  suffers  a  Byzantine  failure  if  it  changes  its  state  or  sends  messages  in  a  way  not 
specified  by  the  transition  functions  of  the  algorithm.  No  restrictions  are  made  on  its  state 
transitions  or  what  messages  it  sends,  and  so  it  may  exhibit  arbitrary  behavior.  Furthermore, 
the  time  between  successive  steps  of  a  faulty  process  might  not  be  in  the  interval  [C1.C2]. 
The  messages  it  sends,  however,  are  delivered  within  time  d  of  when  they  are  sent. 


2.1.3  Consensus 

Finally,  we  define  the  consensus  problem.  We  assume  that  each  process  begins  with  an 
initial  binary  value  (its  “input”)  as  part  of  its  local  state  and  may  irreversibly  “decide”  on 
a  value  by  entering  a  specially  designated  state.  The  problem  is  for  the  processes  to  agree 
on  a  binary  value  despite  the  failure  of  some  processes. 'We  say  that  a  timed  execution  a  is 
f -admissible  if  at  most  /  processes  fail  in  a.  An  algorithm  solves  the  consensus  problem  for 
f  failures  within  time  T  provided  that  for  each  of  its  f -admissible  timed  executions  q.  (1)  no 
two  different  processes  decide  on  different  values  (agreement).  (2)  if  some  nonfaulty  process 
decides  on  u,  then  some  process  has  initial  value  v  (validity),  and  (3)  every  nonfaulty  process 
decides  by  time  T  (time  bound).  Note  that  the  validity  condition  does  not  imply  termination: 
termination  is  implied  by  the  third  condition.  We  consider  the  binary  version  of  the  problem, 
where  the  initial  values  are  0  or  1.  Like  the  algorithm  of  [.ADLSDO],  our  algorithm  for  omission 
failures  can  be  extended  to  work  for  any  value  set.  using  the  same  extension  given  there 
([ADLS90],  Section  5.4).  Our  algorithm  for  Byzantine  failures  is  a  general  simulation  for 
any  rounds  based  algorithm  and  therefore  can  simulate  any  synchronous  agreement  algorithm 
for  any  value  set. 
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Chapter  3 


Consensus  in  the  Presence  of 
Omission  Failures 


In  this  chapter,  we  present  a  consensus  algorithm  tolerant  of  send-omission  failures.  The 
algorithm  uses  the  same  strategy  as  that  of  [ADLS90];  we  first  elucidate  this  strategy  by 
describing  a  synchronous  consensus  algorithm  upon  which  it  is  based  and  explaining  our 
algorithm  in  terms  of  that  synchronous  algorithm.  For  n  >  2/  +  1,  the  running  time  of  our 
algorithm  is  4(/  +  l)d+  Cd,  which  is  approximately  within  a  factor  of  4  of  the  lower  bounds 
of  {t  —  l)d  +  Cd  and  {t  +  \)d  ([ADLS90]).  For  n  <  2/,  the  running  time  is  bounded  by  two 
quantities,  (3;;^^  +  5)(/  -+■  l)d  +  Cd  and  (2\/C  +  6)(/  +  l)d  +  Cd. 

In  order  to  motivate  the  work  presented  here,  we  first  discuss  bounds  attainable  by  more 
straightforward  algorithms. 


3.1  Straightforward  upper  bounds 

Attiya,  Dwork,  Lynch,  and  Stockmeyer  ([ADLS90])  give  two  simple  algorithn^s  tolerant  of 
stopping  failures  and  with  running  times  of  roughly  fCd.  One  algorithm  is  based  on  a 
method  for  simulating  any  synchronous  round-based  algorithm;  the  other  is  specific  to  the 
consensus  problem  and  requires  that  the  processes  begin  synchronized.  Both  algorithms  can 
be  modified  to  tolerate  omission  failures  without  seriously  affecting  the  running  times.  We 
briefly  explain  these  two  simple  algorithms  with  the  modifications. 

The  first  simple  algorithm  simulates  any  synchronous  round-based  algorithm  and  takes  at 
most  time  Cd  +  d  per  round.  The  algorithm  works  by  executing  the  round-based  algorithm 
in  parallel  with  a  timeout  task.  The  timeout  task  is  similar  to  the  one  described  at  the 
beginning  of  Chapter  5:  each  process  keeps  a  count  of  the  number  of  steps  it  has  taken  and 


at  each  s'.ep  broadcasts  the  number  of  its  current  step  to  all  other  processes  in  the  form 
“I’m  aUve:  s”  at  step  number  s.  Each  process  also  keeps  track  of  the  "I’m  alive"’  messages 
received  from  other  processes  and  detects  failures  in  the  ex])ectcd  way.  by  detecting  gaps  in 
the  step  numbering  or  by  the  absence  of  messages.  (We  will  in  fact  employ  this  strategy  in 
our  algorithm.)  While  performing  the  timeout  task,  a  process  simulates  each  round  of  the 
synchronous  algorithm  by  asynchronously  executing  it — a  process  simply  waits  indefinitely 
on  every  other  process  for  either  a  message  of  that  round  or  the  detection  of  that  process’s 
failure.  It  is  not  hard  to  see  that  this  accurately  simulates  the  round-based  algorithm;  no 
process  sends  a  round  r  message  before  receiving  a  round  r  —  1  message  from  all  nonfaulty 
processes;  A  simple  inductive  argument  shows  that  by  time  r(Cd  -h  d)  (more  accurately, 
time  C{d  -|-  C2)  +  (d  -f  C2)),  every  process  has  finished  simulating  round  r  of  the  synchronous 
algorithm.  Thus,  any  synchronous  consensus  algorithm  tolerant  of  omission  failures  taking 
/  + 1  rounds  may  be  directly  simulated  to  yield  an  algorithm  for  our  semi-synchronous  model 
that  takes  time  (/  -f  l)(Cd  +  d). 

Under  the  assumption  that  processes  begin  executing  the  algorithm  at  the  same  time,  a 
simpler  algorithm  specific  to  the  consensus  problem  may  be  used.  This  simpler  algorithm 
does  not  make  use  of  any  fault-detection  mechanisms.  If  a  process  starts  with  initial  value 
1,  it  broadcasts  a  1  and  decides  1  and  halts.  If  a  process  ever  receives  a  1  (and  has  not  yet 
halted),  it  does  the  same.  It  is  easy  to  see  that  if  a  correct  process  receives  a  1,  then  some 
correct  process  receives  a  I  by  time  fd  and  subsequently  all  correct  processes  receive  a  1 
by  time  (/  -f  l)d  (more  accurately,  (/  -i-  l)(d  -I-  C2)).  Therefore  a  process  may  decide  0  if  it 
has  run  for  more  than  (/  -t-  \  ){d  -f-  C2)/ct  steps  without  deciding.  This  takes  at  most  time 
approximately  (/  -I-  i)Cd. 

Finally,  we  remark  that  the  efficient  algorithm  of  [.ADLS90]  can  be  modified  to  tolerate 
omission  failures  by  using  the  timeout  task  for  omission  failures  outlined  above.  The  running 
time,  however,  is  then  roughly  f  ^d  +  Cd.  This  bound  follows  from  a  modification  of  the  part 
of  the  analysis  of  [ADLS90]  which  takes  the  sum  over  each  phase  r  of  the  nuntber  of  processes 
that  fail  during  the  sending  of  an  r  message.  Because  only  stopping  failures  are  considered 
in  [ADLS90],  the  analysis  there  concludes  that  a  process  may  fail  duritig  the  sending  of  at 
most  one  r  message  and  therefore  the  sum  over  all  r  is  at  most  /.  If  failures  are  by  omission, 
then  a  process  may  fail  during  the  sending  of  many  r  messages,  but  only  once  for  any  r. 
Because  there  are  at  most  f  +  2  phases  in  any  /-admi.ssihle  execution,  the  sum  over  all  r  is 
at  most  (/  4-  1)/,  resulting  in  a  bound  of  approximately  (f  +  l)fd  +  Cd. 


3.2  Intuition:  the  underlying  synchronous  algorithm 


Our  algorithm  and  the  algorithm  of  [ADLS90]  may  be  interpreted  as  simulations  of  an 
underlying  synchronous  algorithm.  In  this  underlying  synchronous  algorithm,  all  processes 
begin  executing  in  round  0.  In  even  numbered  rounds,  processes  may  decide  only  on  0;  in 
odd  numbered  rounds,  processes  may  decide  only  on  1.  In  round  0,  any  process  with  initial 
value  0  decides  0  immediately  and  broadcasts  a  message  saying  “I  decided  in  round  0”;  any 
process  with  initial  value  1  broadcasts  a  message  saying  *‘I  didn't  decide  in  round  0”  and 
advances  to  round  1.  In  any  subsequent  round  r.  if  a  process  did  not  receive  a  message 
saying  “I  decided  in  round  r  —  1”,  it  may  decide  r  mod  2,  broadcasting  ”1  decided  in  round 
r”;  if  it  did  receive  a  message  saying  "I  decided  in  round  r  —  1”,  it  advances  to  round  r  +  1 
broadcasting  “I  didn’t  decide  in  round  r" . 

It  is  easy  to  see  that  if  a  nonfaulty  process  decides  in  round  r  then  no  process  decides 
in  round  r  -I-  1  and  all  processes  then  decide  in  round  r  +  2.  The  algorithm  is  also  ‘'early- 
stopping”;  any  execution  in  which  at  most  /  processes  fail  takes  at  most  /  +  2  rounds  of 
communication.  (This  means  that  all  processes  decide  in  round  /  -I-  2  or  earlier,  despite  the 
fact  that  the  first  round  is  numbered  0,  since  a  decision  in  round  i  is  based  on  messages  sent 
in  round  i  —  1  or  earlier.)  The  is  easily  seen  by  observing  that  if  an  execution  takes  x  rounds 
then  a  faulty  process  decides  in  each  of  rounds  0  through  x  —  3:  if  no  faulty  process  decides 
in  round  f  <  x  —  3  then  either  (1)  a  nonfaulty  process  decides  in  round  i  and  all  processes 
decide  by  round  i  -I-  2,  or  (2)  no  process  decides  in  round  i  and  therefore  they  all  decide  in 
round  i  +  1  (because  no  process  receives  an  "I  decided  in  round  i"  message).  Thus,  /  failures 
cause  the  maximum  number  of  rounds,  /  2,  in  the  following  execution.  .All  processes 

except  some  process  jo  begin  with  initial  value  I  and  advance  to  round  1.  Process  jo,  with 
initial  value  0,  broadcasts  'T  decided  in  round  0”  to  all  processes  except  some  other  process 
ji.  Thus  all  processes  except  ji  advance  to  round  2;  ji  decides  in  round  1  and  broadcasts 
“I  decided  in  round  1”  to  all  processes  except  some  process  j2-  This  continues  until  finally 
process  jf-\  decides  in  round  /  —I  and  broadcasts  "I  decided  in  round  /  —  1”  to  all  processes 
except  nonfaulty  process  jj.  which  decides  in  round  f  +  1;  all  processes  subsequently  decide 
in  round  f  +  2. 

Both  our  algorithm  and  that  of  [.ADL.S'90]  ‘‘simulate”  this  synchronous  algorithm,  making 
several  important  optimizations  in  order  to  improve  the  running  time  for  our  model.  If  during 
the  simulation  of  round  r,  a  process  receives  a  me.ssage  saying  ‘1  decided  in  round  r  —  1”. 
it  immediately  advances  to  round  r  -f  1  (without  waiting  for  round  r  —  1  messages  from 
other  proces.ses),  broadcasting  to  all  processes,  in  effect.  "I  know  of  a  process  that  decided  in 
round  r  —  1".  Other  processes  in  round  r  that  receive  uliis  message  relay  it  to  all  processes 
and  akso  advance  immediately  to  round  r  -f  1.  A  process  may  decide  in  round  r  only  if  it 
can  be  sure  that  no  nonfaulty  process  decided  in  round  r  —  1.  This  is  ascertained  only  when. 


for  every  other  process  p,  either  (1)  the  message  “I  didn’t  decide  in  round  r  —  1”  is  received 
from  p,  or  (2)  p  has  been  detected  as  faulty  (by  the  timeout  protocol),  or  (3)  for  some 
r'  <  r  —  1,  the  message  “I  decided  in  round  r'”  has  been  received  from  p  (also  remembered 
by  the  timeout  protocol). 

The  key  to  the  improved  efficiency  of  our  algorithm  relative  to  that  of  [ADLS90]  is  the 
addition  of  a  mechanism  for  a  process  to  detect  ils  own  failure.  VVe  require  that  a  process 
receive  at  least  n  —  /  acknowledgments  for  every  message  of  the  synchronous  algorithm  that 
it  sends.  Until  a  process  has  received  a  sufficient  number  of  acknowledgm.ents  for  its  round 
r  message,  it  is  prohibited  from  deciding  in  round  r  +  1  or  advancing  to  round  r  -|-  2.  This 
is  important  to  the  efficiency  of  the  algorithm  because  it  limits  to  1  the  number  of  times  a 
faulty  process  can  omit  a  message  of  the  synchronous  algorithm  to  all  nonfaulty  processes. 
For  n  >  2f  +  1,  the  convention  of  waiting  for  acknowledgments  ensures  that  a  faulty  process 
does  not  advance  to  round  r  +  1  if  it  omits  to  all  nonfaulty  processes  a  message  saying  "I 
know  of  a  process  that  decided  in  phase  r”.  If  it  does  send  such  a  message  to  a  nonfaulty 
process,  that  nonfaulty  process  in  turn  relays  it  to  all  other  processes;  the  faulty  process 
therefore  ha.s  not  delayed  the  algorithm  by  very  much  (time  d  at  most).  The  convention 
of  waiting  for  acknowledgments  requires  that  a  process  continue  executing  the  algorithm, 
sending  acknowledgments,  after  it  has  decided. 


3.3  The  algorithm 

We  first  explain  the  presentation  of  our  algorithm.  We  describe  our  algorithm  a.s  the  parallel 
composition  of  a  fault-detection  protocol  and  a  main  algorithm.  .At  each  step,  a  process 
first  executes  the  code  of  the  fault-detection  protocol,  then  executes  the  code  of  the  main 
algorithm,  and  finally  sends  a  message.  (Recall  that  in  our  model  a  process  may  send  at 
most  one  message  at  each  step). 

This  message  is  the  concatenation  of  possibly  several  component  “messages’’  which  are 
specified  by  the  queue  commands  in  the  code:  if  during  a  step,  the  statement  “queue  ‘m’" 
is  executed  in  the  code,  then  “message"  m  is  a  component  of  the  message  sent  at  the  end 
of  that  step.  We  will  refer  to  a  message  by  any  one  of  its  components:  we  will  say  “an  m 
message”  or  simply  “an  m”  to  refer  to  any  message  with  in  as  one  of  its  components. 

Our  model  also  specifies  that  a  process  receives  messages  only  during  delivery  events  (and 
therefore  only  between  process  steps).  For  every  delivery  event,  a  process  changes  its  state 
by  adding  the  received  message  to  a  buffer  (an  unordered  set).  .At  its  next  step,  the  process 
reads  and  empties  this  buffer.  A  conditional  statement  in  the  code  referring  to  the  receipt 
of  a  message  checks  whether  such  a  message  was  read  from  this  buffer  during  the  given  step. 
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For  ease  of  presentation,  some  components  of  a  process’s  state  are  not  explicitly  named 
or  maintained  in  the  code — for  instance,  the  number  of  steps  a  process  has  taken,  whether  it 
has  decided,  or  whether  it  has  sent  a  certain  message.  Process  index  subscripts  are  omitted 
in  the  code  but  used  in  the  text  (e.g.,  "‘A”)  to  refer  to  a  local  variable  (D)  of  process  i. 


3.3.1  The  fault-detection  protocol 

In  order  to  tolerate  omission  failures,  our  algorithm  employs  the  timeout  protocol  described 
in  Section  3.1.  A  process  sends  a  message  at  every  step  that  it  takes,  consecutively  numbering 
all  messages  that  it  sends  with  the  number  s  of  its  current  step.'  Before  a  process  decides, 
the  message  that  it  sends  at  every  step  is  of  the  form  ‘'I'm  alive:  s" ,  where  s  is  the  number 
of  its  current  step;  after  a  process  decides,  the  message  is  of  the  form  ‘‘I’ve  decided:  s”. 
The  failure  of  a  process  can  thus  be  detected  by  a  gap  in  the  sequence  numbering  (recall 
we  assume  that  message  links  deliver  messages  in  the  order  sent)  or  by  the  absence  of  any 
messages  for  too  long  a  period  of  time  (more  than  time  d  +  C2). 

All  processes  detected  as  faulty  are  added  to  a  local  set  F.  When  a  process  t  detects  the 
failure  of  another  process  j,  it  broadcasts  this  fact  in  the  form  of  a  "shutdown  j''  message. 
Upon  receiving  this  message,  other  processes  add  j  to  their  respective  sets  F;  when  process 
j  receives  this  message,  it  halts,  ceasing  its  execution  of  the  algorithm.  The  timeout  protocol 
also  keeps  track  of  which  processes  have  decided.  When  a  process  receives  a  message  “I've 
decided:  s”  from  another  process,  it  adds  that  process  to  its  set  D.  When  a  process  i  adds 
j  to  Di  (F,,  resp.),  it  is  said  to  have  "detected"  that  j  has  decided  (failed,  resp.).  We  say 
that  a  process  i  is  shut  down  at  time  t  if  it  receives  a  "shutdown  i"  message  at  time  t.  The 
code  for  the  fault-detection  protocol  is  in  Figure  3.1. 

We  now  verify  two  basic  properties  of  the  fault-detection  protocol  with  respect  to  arbi¬ 
trary  executions.  The  first  bounds  the  time  by  which  a  failure  is  delected. 

Lemma  3.1  If  at  time  t,  process  j  omits  a  message  to  process  i,  and  i  is  not  shut  down  by 
time  t  +  C{d  +  C2)  +  (d  -|-  C2)  ^  -f  Cd  -|-  d,  then  i  adds  j  to  F,  by  that  time. 

Proof:  Let  Sj  be  the  step  number  of  j  at  which  it  omits  a  message  to  i.  The  lemma  is 
clearly  true  if  j  sends  a  message  to  i  at  a  step  numbered  greater  than  Sj  and  that  message 
arrives  at  i  by  iimet  +  C(d  +  C2)  +  {d  +  C2).  If  j  does  not  send  such  a  message,  then  i  receives 
no  message  from  j  between  time  t  +  d  and  /  -f  d  + C2(  1  -f  (d-l- C2)/ci )  =  <  -f  (d-h  C2)  -f-  C(d-t- q), 
in  which  time  i  takes  more  than  (d  -f  C2)/ci  steps  and.  since  it  is  not  yet  shut  dowm.  adds  j 
to  F,.  ■ 

'As  a  consequence  of  the  bound  on  running  time  to  be  derived,  these  sequence  numbers  are  bounded  by 
a  function  of  /,  d,  Ci  and  co. 
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Step  s: 


If  “shutdown  i"  received,  then  halt. 

If  decided,  then  queue  “Fve  decided:  s” 
else  queue  "I’m  alive:  s" . 

For  each  j  ^  Dli  F, 

if  “shutdown  y  message  received 

then  F  <—  FU  {j}\  queue  “shutdown  j” 
if  “I’m  decided;  Sj”  message  received  from  j 
then  D  D  [J  {j} 

if  “I’m  alive”  messages  from  j  not  numbered  consecutively 
then  F  ♦—  FU  {j};  queue  “shutdown  j” 
if  no  message  received  from  j  and  more  than 

{d  +  C2)jci  steps  taken  since  last  message  received  from  j, 
then  F  <—  FU  {j}\  queue  "shutdown  j”. 

Figure  3.1:  The  fault-detection  protocol  for  i  at  step  number  s. 


The  second  property  verifies  that  nonfaulty  processes  are  never  declared  faulty. 

Lemma  3.2  If  process  i  does  not  fail  in  an  execution,  then  i  is  not  added  to  any  set  Fj  and 
is  never  shut  down. 

Proof:  For  contradiction,  let  j  be  the  process  that  first  adds  i  to  its  failea  set  Fj.  Process  j 
adds  i  to  Fj  because  either  it  receives  a  “shutdown  i”  message,  or  it  receives  two  “I’m  alive” 
messages  from  i  with  a  gap  in  sequence  numbering,  or  it  does  not  receive  an  “I’m  alive” 
message  from  i  for  more  than  (d  -|-  C2)/ci  steps. 

By  our  choice  of  j,  process  j  cannot  receive  a  "shutdown  i"  message  before  adding  i  to 
Fj — that  would  imply  that  some  other  process  added  i  its  failed  set  before  j  did. 

Because  i  is  nonfaulty  (and  the  links  are  FIFO),  j  does  not  receive  two  "I'm  alive” 
messages  with  a  gap  in  the  sequence  numbering. 

Before  it  decides,  i  sends  "I’m  alive”  messages  at  every  step  it  takes  and  so  any  two 
messages  are  delivered  to  j  at  most  time  d  +  c-2  apart  (if  one  message  is  delivered  immediately 
and  the  following  message  is  delayed  by  d).  In  time  d  +  C2,  j  can  take  at  most  {d  -t-  C2)/ci 
steps  and  therefore  does  not  add  i  to  Fj.  .After  /  decides,  it  broadcasts  an  “I’m  decided” 
message,  which  causes  j  to  add  i  to  Dj  and  prevents  j  from  adding  i  to  Fj  thereafter.  Thus. 
j  cannot  add  i  to  Fj.  ■ 
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3.3.2  The  main  algorithm 


The  main  algorithm  is  basically  an  asynchronous  version  of  the  synchronous  algorithm  of 
Section  3.2.  The  code  for  the  main  algorithm  appears  in  Figure  3.2.  We  call  the  simulation 
of  round  r  of  the  synchronous  algorithm  "phase  r”.  Each  process  i  starts  in  phase  0  with  v, 
set  to  its  own  private  value  (1  or  0).  In  its  first  step,  a  process  either  decides  0  or  advances  to 
phase  1.  As  with  the  synchronous  algorithm,  in  even  numbered  phases  a  process  can  decide 
only  0,  and  in  odd  numbered  phases  a  process  can  decide  only  1. 

When  a  process  advances  from  phase  r  to  phase  r+l,  it  broadcasts  an  “r”  message.  (This 
is  the  equivalent  of  the  message  ‘‘I  didn't  decide  in  round  r”  in  the  synchronous  algorithm). 
When  a  process  decides  in  phase  r,  it  broadcasts  an  "r  +  1”  message.  (The  message  'V  +  1” 
replaces  the  messages  "I  decided  in  round  r”  and  ‘d  didn’t  decide  in  round  r  +  1”  of  the 
synchronous  algorithm;  both  have  the  meaning  '‘I  know  of  a  process  that  decided  in  round 
r”,  so  it  is  unnecessary  to  distinguish  between  them.)  Set  AF  contains  those  processes  from 
which  an  r  message  has  been  received.  A  process  may  decide  in  phase  r  only  if  it  has  (1)  not 
yet  received  an  r  message,  and  therefore  does  not  know  of  a  process  that  decided  in  round 
r,  and  (2)  has  received  an  r  —  1  message  from  all  processes  not  yet  detected  as  faulty  or 
decided,  indicating  that  they  did  not  decide  in  round  r  —  1.  If  process  i  is  nonfaulty,  then 
the  receipt  of  an  r  +  1  message  from  i  prevents  other  processes  from  deciding  in  phase  r+l 
since  they  do  not  add  i  to  D  or  F  before  receiving  it.  A  process  that  decides  in  round  r  does 
not  send  an  r  message  unless  it  receives  one  first  (this  implies  that  some  process  decided  in 
round  r  —  1  but  failed). 

Our  convention  of  acknowledging  messages  works  as  follows.  Each  process  maintains  a 
set  A’'  containing  those  processes  from  which  a  properly  sequenced  ‘‘ack{-,r)”  message  has 
been  received.  (The  restriction  to  properly  sequenced  "ack(-,r)”  messages  is  achieved  by 
not  adding  a  process  to  A’’  if  that  process  is  already  in  F.  This  restriction  is  necessary  only 
for  the  bound  when  n  <  '2f.)  Until  a  process  decides,  it  sends  exactly  one  acknowledgment 
message,  “ack(j,  r')”,  for  each  r'  message  where  r'  is  less  than  its  current  phase  number.  After 
a  process  decides  in  some  phase  r,  it  continues  to  acknowledge  r'  messages  for  r'  <  r  1. 
This  is  implemented  in  the  code  by  allowing  the  process  to  advance  to  phase  r  +  2  but  no 
further.  It  is  not  necessary  for  a  process  to  acknowledge  r'  messages  for  r'  >  r  +  1  because 
as  we  will  see,  if  it  is  nonfaulty  then  other  nonfaulty  processes  do  not  advance  to  phase  r  +  3 
without  deciding  and  therefore  do  not  require  acknowledgments  for  their  r  +  2  messages. 
Until  a  process  has  received  at  least  n  —  f  properly  sequenced  acknowledgments  for  its  r 
message  (1A''"^|  >  n  —  /),  it  may  not  advance  to  phase  r  1  or  decide  in  phase  r. 

Definition  1  A  process  i  is  blocked  in  phase  r  (Jor  r  >  0)  if  it  advances  to  phase  r  without 
deciding  and  never  has  |.4[~*|  >n  —  /'. 


Being  blocked  is  a  permanent  state,  but  even  if  a  process  is  not  blocked  in  phase  r,  it  may 
be  temporarily  delayed  from  advancing  to  phase  r-f  i  as  it  waits  for  acknowledgments  before 
proceeding. 

Phase  0:  If  n  =  1,  then  queue  “0”  and  goto  PHASE  1. 

If  u  =  0,  then  queue  '•‘1”  and  decide  0  and  goto  Phase  2. 

Phase  r  >  0:  For  each  j  and  each  r'.  I  <  j  <  n  and  0  <  r'  <  r. 
if  “r'”  message  received  from  j. 

then  Ar'  ^  Ar'  U  {j] 
if  “ack(i,  r  —  1)”  received  from  j  and  j  ^  F, 
then  A--'  ^  A'-‘  U  {j} 

if  j  €  AI'''  and  r'  <  r  and  “ack(j,  r')"  not  yet  sent, 

then  queue  “ack(j,r')”.  (whether  decided  or  not) 

If  decided  and  AF~^  ^  0  and  “r  —  2”  not  yet  sent, 
then  queue  “r  —  2” 

If  not  decided  and  >  n  -  /,  (enough  ack's  received) 

then  if  M’’  ^  0,  (some  process  decided  in  phase  r  —  1) 

then  queue  “r”  and  goto  Phase  r  +  1 
if  A/'  =  0  and  j  €  A/’’"'  for  all  y  ^  (D  U  F), 

then  queue  “r  +  1"  and  decide  r  mod  2  and  goto  PHASE  r  +  2 

Figure  3.2:  The  main  algorithm  of  process  i,  performed  at  every  step.  Initially, 
a  process  is  in  phase  0  with  AF  =  /P  =  0  for  all  r. 


We  prove  here  a  few  basic  lemmas  about  the  main  algorithm  with  respect  to  any  /- 
admissible  execution.  The  first  two  lemmas  affirm  two  expected  properties  that  held  for  the 
synchronous  algorithm. 

Lemma  3.3  If  some  nonfaulty  process  decides  in  phase  r  >  0  then  no  process  decides  in 
phase  r  +  1. 

Proof:  Let  i  be  a  nonfaulty  process  that  decides  in  phase  r  and  consider  any  other  process  j. 
According  to  the  code  of  the  main  algorithm,  j  cannot  decide  in  phase  r  +  1  without  receiving 
an  r  message  from  i  or  adding  i  to  Fj  or  Dj.  We  claim  that  neither  can  happen  before  j 
receives  an  r  +  1  message  from  i,  which  according  to  the  code  (which  requires  ^  0) 

precludes  j  from  deciding  phase  r-f- 1-  First,  because  i  is  nonfaulty,  by  Lemma  3.2  it  is  never 


added  to  Fj.  Process  i  may  send  an  r  message,  but  only  after  sending  an  r  +  1  message. 
Because  i  is  nonfaulty,  it  does  not  omit  this  message,  and  because  messages  are  delivered  in 
the  order  sent,  j  does  not  receive  it  before  receiving  the  r  +  1  message.  Process  ^  is  added  to 
Dj  only  when  j  receives  the  message  "rve  decided’’  from  i.  By  the  same  argument,  j  does 
not  receive  “I’ve  decided”  from  i  before  recei’,  mg  the  r  +  1  message.  ■ 

The  following  definition  is  useful  in  proving  correctness  and  analyzing  time  complexity. 
Definition  2  Phase  r  is  quiet  if  there  is  some  process  that  never  receives  any  r  messages. 


Lemma  3.4  If  a  nonfaulty  process  decides  in  phase  r  >  0  then  phase  r  +  2  is  quiet. 

Proof:  By  Lemma  3.3,  no  process  decides  in  phase  r  -1-  1.  If  a  process  does  not  decide  in 

phase  r  4-  1,  then  it  does  not  send  an  r  +  2  message  until  it  receives  one.  Therefore,  no 

process  sends  an  r  +  2  message  and  in  fact  no  process  receives  an  r  message.  h 

The  next  two  lemmas  affirm  that  the  convention  of  acknowledging  r  messages  works  as 
expected — nonfaulty  processes  are  never  blocked — and  the  last  lemma  states  that  the  failure 
of  blocked  processes  is  eventually  detected  by  all  processes. 

Lemma  3.5  For  any  process  i  and  any  nonfaulty  process  j ,  if  i  advances  to  phase  r  >  1 
without  deciding  and  sends  an  r'  message  j  for  0  <  r'  <  r  —  1,  then  i  receives  an  '‘ack{i,  r  —  1 )  ” 
message  from  j . 

Proof:  By  induction  on  r.  Clearly  the  lemma  is  true  for  r  =  1:  j  advances  to  phase  1 
during  its  first  step  and  sends  ■'ack(i,0)”  during  the  next  step  at  which  it  has  received  a  0 
message  from  i. 

Assume  the  lemma  is  true  for  r  —  I  >  1.  First  observe  that  j  does  not  decide  in  any 
phase  r'  <  r  —  3:  by  Lemma  3.3,  this  would  imply  that  no  process  decides  in  pha^e  r'  +  1  and 
therefore  no  process  sends  an  +  2  message,  but  this  is  not  possible  because  i  advances  to 
phase  r  >  r'  +  3  without  deciding  and  therefore  must  receive  an  r'4-2  message.  If  j  decides  in 
phase  r'  and  r'  =  r  —  2  or  r  — 1,  then  j  immediaiely  advances  to  phase  r'  +  2  >  r  after  deciding 
and  sends  “ack(?,r  — 1)”  to  i.  Suppose  that  j  does  not  decide  in  any  phase  r'  <  k  —  l.  Process 
j  must  advance  from  each  phase  r'  <  k  —  I  because  it  is  never  shut  down,  has  Mf'  ^  0,  and 
hcis  [Aj  \  >  n  —  f:  j  is  never  shut  down  by  Lemma  3.2;  j  has  Mj’  ^  0  because  it  receives 
an  r'  message  from  i\  j  has  1A^*|  >  n  —  f  because  it  is  nonfaulty  and  therefore  sends  an 
r"  message  to  all  processes  for  each  r"  <  r'  —  I  and  by  the  induction  hypothesis  receives 
“ack(j,  r")”  from  all  nonfaulty  processes — none  of  which,  by  Lemma  3.2  are  ever  added  to 
Fy  Process  j  therefore  advances  to  pha.se  r  and  may  then  send  "a.ck{i,r  —  1)"  to  i.  m 
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Corollary  3.6  If  process  i  is  nonfaulty  and  advances  to  phase  r  >  1  without  deciding,  then 
it  eventually  has  >  n  —  f.  f/l  nonfaulty  process  is  never  blocked.) 

Proof;  Because  ^  is  nonfaulty  and  advances  to  phase  r  without  deciding,  for  0  <  r'  <  r  it 
sends  an  r'  message  to  all  processes  as  it  advances  to  phase  r'  +  1.  By  Lemma  3.5,  i  receives 
“ack(i,r—  1)”  from  each  nonfaulty  process.  Because  by  Lemma  3.2,  nonfaulty  processes  are 
never  added  to  F,,  each  nonfaulty  process  is  added  to  .4[~\  giving  the  necessary  bound.  ■ 

The  following  lemma  relies  on  the  fact  that  a  process  continues  to  take  steps,  executing 
the  algorithm  after  it  decides;  in  particular,  it  continues  to  detect  the  failure  of  processes 
and,  if  necessary,  send  acknowledgments. 

Lemma  3.7  If  a  faulty  process  j  unsuccessfully  broadcasts  an  r  message  at  time  t  and  is 
subsequently  blocked  in  phase  r  +  1,  then  all  processes  not  shut  down  by  time  t  +  C{d+  C2)  + 
2{d  +  C2)  »  t  4-  Cd  +  2d  detect  the  failure  of  j  by  that  time. 

Proof:  By  the  definition  of  being  blocked,  j  advances  to  phase  r  +  1  but  never  has  > 
n  —  f.  Thus  there  is  some  nonfaulty  process  i  never  added  to  /IJ.  By  Lemma  3.5,  j  omits 
an  r'  message  to  i  for  some  0  <  r'  <  r.  This  omission  occurs  at  or  before  time  t.  By 
Lemma  3.1,  i  detects  this  failure  by  time  t  +  C{d  +  C2)  +  [d  +  C2),  broadcasting  “shutdown 
j”  to  all  processes  in  the  same  step.  By  time  d  +  C2  later,  all  processes  not  yet  shut  down 
have  received  this  message  and  taken  a  step,  adding  j  to  their  failed  sets.  ■ 


3.4  Correctness  proof 

We  now  prove  that  in  all  /-admissible  executions,  the  algorithm  terminates  and  correctly 
satisfies  the  agreement  and  validity  conditions.  We  first  prove  “progress” — that  processes 
in  fact  advance  to  successive  phases  as  expected.  Given  this  progress  lemma  and  a  few 
simple  facts  about  quiet  phases,  the  proofs  of  agreement,  validity,  and  termination  are  easily 
derivable.  These  proofs  follow  the  same  reasoning  as  the  informal  argument  about  the 
synchronous  algorithm  outlined  in  Section  3.2. 

Lemma  3.8  For  each  r  >  0  and  each  process  i  that  is  neither  blocked  nor  shut  down  in  any 
phase  r'  <  r,  process  i  either  decides  in  some  phase  r'  <  r  or  advances  to  phase  r  1. 

Proof:  For  contradiction,  let  phase  r  be  the  first  phase  for  which  the  lemma  is  not  satisfied 
and  let  i  be  any  process  for  which  the  lemma  is  not  satisfied  at  phase  r.  By  the  choice  of  r, 
i  advances  to  phase  r. 
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First  note  that  r  ^  0,  since  every  process  either  decides  or  advances  to  phase  1  during 
its  first  step. 

VVe  show  below  that  for  r  >  0  and  for  every  process  j,  either  i  either  receives  an  r  —  1 
message  from  j  or  adds  j  to  Fi  or  Z?, .  We  thus  derive  a  contradiction  by  concluding  that  i 
may  either  decide  or  advance  to  phase  r  +  1,  since  it  has  j  €  for  ail  j  ^  {Di  U  Fi)  and 

by  assumption  is  not  shut  down  and  eventually  has  >  n  —  /  (is  not  blocked  in  phase 

r). 

Let  j  be  any  other  process.  First  consider  the  case  that  j  also  is  neither  shut  down  nor 
blocked  in  any  phase  r'  <  r  and  that  further,  j  does  not  fail  directly  to  i.  By  the  choice  of 
r,  j  either  advances  to  phase  r  or  decides  in  a  previous  phase.  If  j  advances  to  phase  r,  then 
it  must  send  an  r  —  1  message  to  i  (successfully,  in  this  case).  It  cannot  be  that  process  j 
decides  in  pheise  r  —  1,  since  that  would  imply  sending  an  r  message  to  i,  thus  enabling  i 
to  advance  immediately  to  phase  r  +  1,  contradicting  our  original  assumption.  If  j  decides 
before  phase  r  —  1,  then  it  sends  an  “I’ve  decided”  message  to  i  and  is  added  to  Z?,. 

Now  consider  the  case  that  j  is  either  shut  down  or  blocked  in  some  phase  r'  <  r  or  j 
fails  directly  to  i.  If  j  is  blocked,  then  by  Lemma  .3.7,  i  will  eventually  detect  that  j  is  faulty. 
Similarly,  if  j  is  shut  down,  then  it  halts  and  i  will  detect  its  failure  by  timeout.  Lastly, 
Lemma  3.1  ensures  that  if  j  fails  directly  to  i  and  i  is  not  shut  down,  then  i  eventually 
detects  j  as  faulty  and  adds  it  to  F,.  ■ 

Corollary  3.9  For  any  r  >  0,  every  nonfaulty  process  either  decides  in  phase  r'  <  r  or 
advances  to  phase  r  +  1. 

Proof:  By  Lemmas  3.2  and  3.6,  a  nonfaulty  process  is  never  shut  down  or  blocked;  the 
corollary  then  follows  immediately  from  Lemma  3.8.  ■ 

Corollary  3.10  If  phase  r  >  0  is  quiet,  then  each  nonfaulty  process  decides  in  some  phase 
r'  <  r. 

Proof:  By  Corollary  3.9,  each  nonfaulty  process  either  decides  in  phase  r'  <  r  or  advances 
to  phase  r  +  1.  But  a  nonfaulty  process  cannot  advance  to  phase  r  +  1:  to  do  so,  it  would 
send  an  r  message  to  all  processes,  contradicting  the  assumption  that  phase  r  is  quiet.  ■ 

Lemma  3.11  (Agreement)  No  two  nonfaulty  processes  decide  on  different  values. 

Proof:  Let  r  be  the  first  phase  in  which  some  nonfaulty  process  i  decides.  By  Lemma  3.3, 
no  process  decides  in  phase  r  +  1.  Because  no  process  decides  in  phase  r  +  1,  no  process 
sends  an  r  +  2  message  and  thus  phase  r  -I-  2  is  quiet.  Thus  by  Lemma  3.10,  all  nonfaulty 
processes  decide  in  some  phase  r'  <  r  +  2.  By  the  choice  of  r,  all  nonfaulty  processes  decide 
in  either  phase  r  or  phase  r  +  2,  in  either  case  on  r  mod  2.  ■ 


Lemma  3.12  (Validity)  If  any  process  decides  on  value  b,  then  some  process  i  starts  with 

V,  =  b. 


Proof:  Clearly  if  some  process  J  decides  on  1,  it  does  so  in  phcise  r  >  0  and  that  process 
itself  must  have  started  with  vj  =  1  since  otherwise  it  would  have  decided  on  0  during  its 
first  step. 

If  some  process  j  decides  on  0,  it  cannot  be  that  all  processes  started  with  u,  =  1. 
For  then,  no  process  would  decide  in  phase  0  and  no  process  would  send  a  1  message.  No 
process  would  receive  a  1  message  and  therefore  no  process  would  advance  to  phcise  2  without 
deciding  and  so  no  process  would  decide  0.  ■ 


Lemma  3.13  (Termination)  In  any  /-admissible  execution,  there  is  a  quiet  phase  num¬ 
bered  at  most  /  +  2  and  so  each  nonfaulty  process  decides  in  some  phase  r  <  /  +  2. 

Proof:  If  some  nonfaulty  process  decides  in  phase  r  <  f  then  no  process  decides  in  phase 
r  +  l  and  no  process  sends  an  r-|-2  message.  Phase  r  +  2  is  therefore  quiet  and  by  Lemma 3.10 
all  nonfaulty  processes  decide  by  phase  r  +  2  <  /  +  2. 

If  no  nonfaulty  process  decides  in  any  phase  r  <  f,  then  there  must  be  a  phase  h, 
0  <  <  /,  in  which  no  faulty  process  decides,  and  therefore  in  which  no  process  decides. 

If  a  process  does  not  decide  in  phase  /i,  then  it  does  not  send  an  A  +  1  message  until  it 
receives  one.  Therefore  no  process  sends  aji  A  +  1  message — phase  A  +  1  is  quiet — and  by 
Lemma  3.10,  all  nonfaulty  processes  decide  by  phase  A  +  l</-fl.  ■ 
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3.5  Analysis  of  time  bounds 


We  now  bound  the  amount  of  real  time  until  all  nonfaulty  processes  decide  in  any  /- 
admissible  execution.  The  analysis  in  this  section  is  carried  out  with  respect  to  any  given 
/-admissible  execution.  Having  already  proved  the  correctness  of  the  algorithm,  we  will  here¬ 
after  assume  d  Cj  and  make  approximations  appropriately.  We  first  establish  the  tools  for 
our  analysis  and  then  conclude  with  the  nearly  optimal  bound  for  n  >  2/-f  1  (Section  3.5.1) 
and  two  bounds  for  n  <  2/  (Section  3.5.2).  We  first  introduce  some  notation. 

•  For  r  >  0,  let  tr  be  the  earliest  time  by  which  all  processes  not  blocked  in  any  phase 
r'  <  r  of  the  execution  have  either  decided,  advanced  to  phase  r  -f  1,  or  been  shut 
down. 

Because  every  process  either  decides  or  advances  to  phase  1  on  its  first  step,  to  =  0. 

•  Let  phase  h  be  the  first  (smallest  numbered)  phase  that  is  quiet. 

•  For  r  >  0,  let  5^  =  {i  :  i  is  blocked  in  phase  r  +  1};  let  6,  =  lBr|. 

The  definition  of  Br  may  seem  unusual,  but  makes  sense  on  closer  analysis.  We  will  want 
to  bound  tr  —  tr-i,  which  we  think  of  as  the  time  for  phase  r,  in  terms  of  the  number  of 
processes  that  omit  an  r  message  to  all  nonfaulty  processes.  This  number  is  br,  since  all  such 
processes  are  subsequently  blocked  in  phase  r  -I-  1. 

Lemma  3.14  For  r  ^  r',  Br  C\  Br'  =  ^. 

Proof:  By  definition,  a  process  must  advance  to  phcise  r'  in  order  to  be  blocked  in  phase 
r'.  U  r  <  r'  and  i  €  Br,  then  i  is  blocked  in  phase  r  -|-  1  <  r'  and  cannot  advance  to  phase 
r  -I-  2  <  r'  -f  1  or  greater.  Therefore,  i  is  not  blocked  in  phase  r'  -|- 1  and  cannot  be  in  Br'-  ■ 

Corollary  3.15  ^  /• 

Proof:  By  Corollary  3.6,  a  nonfaulty  process  is  not  in  any  Br  -  so  these  sets  consist  of  faulty 
processes  only.  The  bound  of  /  then  follows  immediately  from  the  disjointness  of  the  sets 
Br,  from  Lemma  3.14.  ■ 

We  prove  our  upper  bound  by  summing  the  times  of  the  individual  phases.  We  will  say 
“the  time  of/for  phase  r"  to  mean  tr  —  tr-i-  We  prove  an  upper  bound  for  two  kinds  of  phaises: 
those  that  are  quiet  and  those  that  are  not.  We  first  derive  some  useful  lemmas  about  the 
receipt  of  acknowledgments.  We  then  prove  an  upper  bound  on  the  time  to  complete  any 
phase — in  particular,  quiet  phases.  We  then  prove  a  lemma  (Lemma  3.19)  that  is  at  the 
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heart  of  the  timing  analysis,  regarding  causal  chains  of  r  messages.  The  time  for  phases  that 
are  not  quiet  depends  on  whether  or  not  u  >  2/  -H  1  and  will  be  deferred  until  the  following 
subsections  (Sections  3.5.1  and  3.5.2),  where  we  will  also  sum  over  the  phases  to  derive  the 
total  time  bounds. 

We  first  prove  a  useful  lemma  about  the  timeliness  of  acknowledgments:  if  a  process 
receives  a  sufficient  number  of  properly  sequenced  acknowledgments  for  its  r  message,  then 
it  receives  them  promptly,  by  time  tr-i  +  2d. 

Lemma  3.16  For  r  >  0,  if  process  j  eventually  has  >  n  —  f,  then  it  has  \A’j\  >  n  —  f 
by  time  tr  +  2d. 

Proof:  Process  j  sends  an  r  message  either  as  it  advances  to  phase  r  +  1  or  as  it  decides  in 
phase  r  —  1.  If  process  j  broadcasts  its  r  message  because  it  advances  to  phase  r  +  1,  then 
it  is  clearly  not  blocked  in  any  phase  r'  <  r  and  is  neither  decided  nor  shut  down  before 
it  broadcasts  this  message,  and  so  broadcasts  it  by  time  tr-  Similarly,  if  j  broadcasts  its 
r  message  because  it  decides  in  phase  r  —  1,  then  it  does  so  by  time  tr-\.  In  either  case, 
j  broadcasts  its  r  message  by  time  and  any  process  that  receives  an  r  message  from  j 
receives  it  by  time  4  +  d. 

Consider  any  process  i  6  We  claim  that  i  sends  “ack(;,  r)”  by  time  tr  -b  d.  By  the 
fact  that  it  sends  “ack(j,  r)”  eventually,  process  i  must  advance  to  phaise  r  +  1  or  greater 
(either  by  deciding  in  phase  r  —  1  or  phase  r  or  by  advancing  to  phase  r  + 1  without  deciding) 
before  sending  “ack(j,  r)”.  It  follows  that  i  is  neither  blocked  in  any  phase  r'  <  r  nor  shut 
down  before  it  does  so  and  therefore  advances  to  phase  r  +  1  by  time  4-  By  time  4  +  d,  i 
also  receives  an  r  message  from  j  and  therefore  sends  ‘‘ack(j,  r)”  by  then.  ■ 


Corollary  3.17  For  r  >  0,  if  process  i  sends  an  r  +  1  message  after  time  tr  or  for  some  j 
process  i  sends  “ack{j,r)”,  then  i  has  |i4^|  '>  n  —  f  by  time  tr  +  2d. 

Proof:  If  process  i  sends  an  r  +  1  message  after  time  tr  then  it  does  not  send  the  r  +  1 
message  as  a  result  of  deciding  in  phase  r,  since  processes  that  decide  in  phase  r  do  so  by 
time  tr-  Therefore  i  sends  an  r  +  1  message  as  a  result  of  advancing  from  phase  r  -b  1,  which 
requires  that  it  have  [AJj  >  n  —  f.  By  Lemma  3.16,  i  therefore  has  |.4j|  >  n  —  /  by  time 
tr  +  2d.  m 

We  now  prove  a  generous  upper  bound  on  the  time  to  complete  any  phase  (in  particular, 
quiet  phases).  The  proof  is  very  similar  to  the  proof  of  progress. 
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Lemma  3.18  ti  —  to  <  Cd  +  d  and  for  any  phase  r  >  1, 
tr  <  max(^r-i  +  Cd  +  d,  tr-i  +  Cd  -f  '2d). 

Proof:  For  contradiction,  assume  that  for  r  >  1  (respectively,  r  =  1)  at  time  max(tr-i  + 
Cd  +  d,  tr-2  +  Cd  +  2d)  (resp.  time  to  +  Cd  +  d),  some  process  i  has  neither  decided  nor 
advanced  to  phase  r  -f  1  nor  been  shut  down  and  is  never  blocked  in  any  phase  r'  <  r.  By 
this  time,  by  the  definition  of  <r-i,  i  is  in  phase  r,  and  since  it  is  not  blocked  in  phase  r,  has 
^  ^  /  by  Lemma  3.16.  We  will  reach  a  contradiction  by  showing  that  i  must  decide 

in  phase  r  by  this  time  because  for  every  other  process  j,  either  i  receives  an  r  —  1  message 
from  j  or  i  detects  that  j  has  decided  or  failed  (j  €  A  U  Ft). 

Let  j  be  any  other  process.  First  consider  the  case  that  j  (1)  is  not  blocked  in  any  phase 
r'  <  r  —  1,  (2)  is  not  shut  down  by  time  4-1,  and  (3)  does  not  fail  directly  to  i  before  or  at 
time  4-1-  By  Lemma  3.8,  j  eith,-r  advances  to  phase  r  or  decides  in  some  phase  r'  <  r  —  1\ 
by  definition  it  does  so  by  time  4-i-  If  j  advances  to  phase  r,  then  it  sends  (successfully,  by 
assumption)  an  r  —  1  message  to  i  by  time  4-i  and  i  receives  this  message  by  time  4-i  +  d. 
If  j  decides  in  phase  r'  <  r  —  1,  then  by  time  4'  +  d  <  tr-i  +  d,  i  receives  an  ‘‘Fve  decided” 
message  from  j  and  adds  j  to  Z),  . 

Now  consider  the  case  that  j  either  (1)  is  blocked  in  some  phase  r'  <  r  —  1  (2)  is  shut 
down  by  time  4-i,  or  (3)  fails  directly  to  t  at  or  before  time  4_i.  If  j  is  shut  down  or 
fails  directly  to  i  at  or  before  time  4-i,  then  by  Lemma  3.7.  i  detects  the  failure  by  time 
4-1  +  Cd  +  d.  Case  (1)  is  not  possible  for  r  =  1,  so  we  are  finished  for  that  case.  If  j  is 
blocked  in  some  phase  r'  <  r  —  1,  then  because  it  advances  to  phase  r',  j  neither  decides 
nor  is  blocked  nor  shut  down  in  any  prior  phcise.  Therefore,  by  time  4'-i  <  4-2,  j  advances 
to  phase  r',  broadcasting  (unsuccessfully)  an  r'  —  1  message.  By  Lemma  3.7.  all  processes, 
including  i,  detect  the  failure  of  j  by  time  4-2  +  Cd  +  2d.  m 

In  bounding  the  time  of  a  phase  r  that  is  not  quiet,  we  will  bound  the  time  until  every 
process  receives  an  r  message  (which  every  process  does,  by  the  definition  of  a  quiet  phase). 
By  that  time,  every  process  that  is  not  yet  decided  or  shut  down  or  blocked  in  any  phase 
r'  <  r  may  advance  to  phase  r  +  1;  thus  this  is  a  bound  for  4-  In  bounding  the  time  until 
every  process  receives  an  r  message,  the  following  reasoning  is  at  the  heart  of  the  analysis. 
In  order  for  the  first  r  message  to  ever  be  sent,  some  process  must  decide  in  phase  r  —  1, 
which  by  definition,  it  does  by  time  4-i.  .-^n  r  message  sent  by  any  other  process  i  that  does 
not  decide  in  phase  r  —  1  is  sent  because  i  received  an  r  message.  Thus,  a  causal  chain  of 
r  messages  may  be  followed  and  the  first  r  message  received  by  any  process  can  be  traced 
back  to  a  process  that  originated  it  (ih  in  the  following  lemma),  sending  the  "first”  r  message 
before  4-1-  Because  a  process  broadcasts  an  r  message  as  soon  as  it  receives  one  (at  its 
ne.xt  step,  to  be  precise;  also,  assuming  it  has  |.4’'"'|  >  n  —  /,  which  it  does  after  4-i  +  '2d 
if  at  all),  our  time  bound  for  phases  that  are  not  quiet  is  approximately  d  times  the  length 
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of  the  shortest  such  chain  to  each  process.  We  now  prove  a  lemma  about  the  existence  of 
such  chains  and  their  basic  timing  properties.  This  lemma  is  central  to  every  bound  we  will 
prove  for  the  omission  failures  algorithm. 

Lemma  3.19  If  phase  r  is  not  quiet,  then  for  every  process  io,  there  exists  a  sequence  of 
distinct  processes  io,  ii, . . . ,  ik  o,nd  messages  mo,  mi, . . . ,  nik  with  k  >  0  such  that 

(1)  for  0  <  j  <  k,  ij  sends  the  first  r  message,  m^-i,  received  by  ij-i, 

(2)  exactly  one  process,  ik,  sends  an  r  message  by  time  tr-i,  and 

(3)  for  0  <  j  <  k,  process  ij  sends  an  r  message  {rrij^i)  by  time  tr-i  +  {k  —  j  +  l)d. 

Proof:  Phase  r  is  not  quiet,  so  every  process  ij  receives  an  r  message;  let  mj  be  the  first 
r  message  that  ij  receives.  Define  a  sequence  of  processes  io,i\,. . .  inductively  cis  follows:  if 
ij  sends  an  r  message  by  time  <r-i  then  define  k  =  j  and  let  ij  be  the  last  process  of  the 
sequence;  otherwise,  define  i^+i  to  be  the  process  that  sends  m^. 

We  first  claim  that  the  resulting  does  not  include  repetitions  and  is  therefore  finite.  This 
is  clear  if  io  sends  an  r  message  by  time  t^-i  (then  k  =  j  =  0).  If  not,  we  show  that  for 
any  0  <  j  <  A;,  process  ij  is  distinct  from  processes  io,...,ij-i.  Only  io  may  fail  to  send 
an  r  message.  If  it  does,  then  clearly  it  is  distinct  from  the  other  processes  in  the  sequence; 
if  not,  then  let  m_i  be  any  r  message  that  it  sends.  If  ij  sends  an  r  message  by  time  fr-i, 
then  clearly  it  is  distinct  and  we  are  done.  If  not,  then  for  all  ig,  0  <  x  <  ji,  because  ij 
sends  an  r  message  (mr-i)  later  than  time  <r-i,  ix  must  send  it  as  the  result  of  receiving 
an  r  message  (by  the  definition  of  <r-i,  a  process  that  decides  in  phaise  r  —  1  broadcasts  r 
by  time  4-1  )•  It  follows  that  the  sending  of  mr-i  by  ix  is  preceded  by  the  sending  of  m,, 
the  first  r  message  received  by  4.  Because  a  process  broadcasts  an  r  message  only  once,  it 
follows  that  processes  zo,  ■  ■  ■  ,ij  are  distinct. 

Thus  the  sequence  ik,  ■  ■  ■  .  io  forms  a  chain  of  processes  such  that  for  0  <  j  <  k,  process 
ij  sends  the  first  r  message,  .m,,.,,  received  by  z^_i  and  k  is  the  only  process  in  the  sequence 
to  broadccist  an  r  message  before  time  4-1.  This  proves  (1)  and  (2). 

It  remains  to  show  (3),  the  timing  property.  For  0  <  j  <  k,  the  fact  that  ij  sends  an  r 
message  but  does  not  decide  in  phcise  r  —  1  implies  that  ij  advances  to  phase  r  by  time  4-i, 
since  it  is  not  blocked  in  any  phase  r'  <  r  and  is  not  decided  or  shut  down  before  sending 
mj_i,  which  it  does  after  time  4-i-  Therefore,  by  time  4-i  +  2d,  each  ij  is  in  phase  r  and 
by  Lemma  3.16,  hcis  >  n  —  /.  Since  ik-i  receives  ruk-i  by  time  4-i  +  d,  it  advances 

to  phase  r  +  1,  sending  mk-2  by  time  4-i  +  2d.  Process  ik-2  receives  this  message  by  time 
4-1  +  3d  and  thus  advances  to  phase  r  +  1.  sending  mfc_,3  by  time  4-i  +  3d.  Similarly,  for 
0  <  J  <  A:,  process  ij  receives  ruj  and  sends  by  time  4-i  -I-  (1  +  A*  —  j)d.  ■ 

To  complete  the  lemmfis  necessary  to  tightly  bound  the  running  time,  we  need  only  bound 
the  time  for  any  phase  that  is  not  quiet.  Thi.s  bound  depends  on  whether  or  not  n  >  2f  +  1. 
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3.5.1  Bound  for  n  >  2/  +  1 


We  show  that  the  algorithm  depends  on  C  only  to  the  extent  of  an  additive  factor  of  Cd. 
For  C  large,  this  algorithm  may  be  far  more  efficient  that  a  direct  rounds  simulation.  The 
bound  we  obtain  for  n  >  2/  +  1  is  within  approximately  a  factor  of  4  of  optimal:  our  bound 
is  4(/  4-  l)d  4-  Cd\  the  lower  bound  proved  in  [ADLS90]  is  (/  —  l)d  4-  Cd. 

Having  bounded  the  time  for  quiet  phases  in  Lemma  3.18,  we  need  only  bound  the  time 
for  any  phase  that  is  not  quiet.  If  n  >  2/  4-  1,  vve  can  be  sure  that  when  a  faulty  process 
broadctists  an  r  message,  it  either  sends  to  at  least  one  nonfaulty  process  or  becomes  blocked 
in  phase  r4-l  since  f  <  n  —  f.  If  it  sends  to  a  nonfaulty  process,  then  that  process  will  send  an 
r  message  to  all  processes  and  the  phase  will  end.  The  number  of  processes  blocked  in  phase 
r  4- 1  is  exactly  hr\  our  bound  for  phase  r  is  roughly  br  •  d.  This  is  the  key  difference  between 
our  algorithm  and  the  algorithm  of  [ADLS90]:  a  faulty  process  may  cause  delay  d  only  if 
it  sends  exclusively  to  other  faulty  processes;  the  convention  of  requiring  acknowledgments 
ensures  that  each  faulty  process  can  do  so  only  once. 

To  reinforce  the  intuition  about  this  bound,  we  first  describe  how  this  bound  is  realized 
by  a  worst-case  execution:  Process  1  €  Br  is  the  first  to  send  an  r  message.  It  decides  in 
phase  r  —  1  at  time  tr~i  (no  later,  by  definition  of  fr-i,  since  process  1  is  not  blocked  in 
any  phase  r'  <  r  —  1)  and  sends  an  r  message  to  only  process  2  €  Process  2  waits  until 
time  tr-i  4-  2d  for  >  n  —  f  and  then,  having  received  an  r  message  from  1,  advances 

to  ph2ise  r  4-  1,  sending  an  r  message  to  only  process  3  €  Br.  The  pattern  is  repeated  until 
process  6r  4-  1  ^  Br  receives  an  r  message  at  time  tr-i  4-  (br  4-  l)d.  Process  6r  4-  1  advances  to 
phase  r-f- 1  and  omits  an  r  message  to  exactly  one  nonfaulty  process,  i.  .Ml  nonfaulty  process 
except  i  receive  an  r  message  from  6r  4-  1  at  time  tr-i  +  (br  2)d  and  i  receives  an  r  message 
from  them  at  time  fr-i  4-  (6r  4-  3)<i.  By  this  time,  each  process  has  either  advanced  to  phase 
r  4-  1  (as  it  sent  an  r  message),  decided,  been  shut  down,  or  is  blocked  in  some  phase  r'  <  r. 
This  scenario  shows  where  the  extra  3d  ajises:  one  d  is  caused  by  the  delay  of  waiting  (by 
process  2  in  this  scenario)  for  acknowledgments  from  the  previous  phcise,  another  d  is  for  a 
faulty  process  (here,  6r  4-  1)  that  is  not  blocked  in  phcise  r  4-  1  to  send  an  r  message  to  a 
nonfaulty  process,  and  another  d  is  for  the  remaining  nonfaulty  processes  (here,  i)  to  receive 
an  r  message.  (In  [ADLS90],  only  the  last  extra  d  is  incurred;  this  leads  to  the  factor  of  2 
in  their  bound,  instead  of  4  in  ours.) 

Lemma  3.20  For  n  >  2/  4-  1  and  r  >  1.  if  for  all  r'  <  r  phase  r'  is  not  quiet,  then 
tr  —  tr-l  <  (3  4*  br)d. 

Proof:  We  show  that  by  time  tr-\  4-  (3  4-  br)d.  all  processes  receive  an  r  message.  Thus,  by 
that  time,  every  process  is  either  decided,  shut  down,  blocked  in  some  phase  r'  <  r,  or  may 
advance  to  phase  r  4-  1. 
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By  Lemma  3.19,  we  know  that  for  every  process  io,  there  is  a  sequence  of  distinct  pro¬ 
cesses,  ■  -lik  satisfying  the  three  properties  of  Lemma  3.19. 

Now,  if  fc  <  br  +  2  then  io  receives  niQ  by  time  /r-i  +  (^'  —  1  1  d  <  tr-i  +  (br  +  2)d-\-  d. 

If  k  >  br  +  2,  then  there  is  a  j  such  that  k  —  br  <  j  <  k  and  ij  ^  Br-  By  Lemma  3.19,  ij 
sends  an  r  message  by  time  tr  +  {k  -  j  -|-  I)d  <  tr-i  +  {I  br)d.  Because  ij  ^  Br.  ij  sends 
an  r  message  to  at  least  n  —  f  >  f  +  I  processes,  one  of  which  must  be  nonfaulty.  This 
nonfaulty  process,  receives  an  r  message  from  ij  by  time  <r-i  +  (2  -f  br)d. 

We  now  conclude  the  proof  by  showing  that  process  i  sends  an  r  message,  received  by 
all  processes,  by  time  tr-i  -I-  (2  -I-  br)d.  Because  no  phase  r'  <  r  is  quiet,  it  follows  from 
Lemma  3.4  that  i  does  not  decide  in  any  phcise  r'  <  r  —  2.  If  i  decides  in  phtise  r  —  1,  then 
does  so,  sending  an  r  message,  by  time  4-1.  If  ^  decides  in  phase  r  or  advances  to  phcise  r 
without  deciding,  then  it  does  so  by  time  4-i  and  subsequently  sends  an  r  message  once  it 
receives  one  and  has  >  n  — /,  which,  by  Lemma  3.16,  it  does  by  time  4-i +  (2-t-6r)d.  ■ 

We  can  now  bound  tightly  the  running  time  of  any  /-admissible  execution  by  summing 
the  bounds  for  all  phases  in  that  execution. 

Theorem  3.21  For  n  >  2/  -h  1,  the  algorithm  above  solves  the  consensus  problem  for  f 
omission  failures  within  time  4(/  -f  Ijd  -f-  Cd. 

Proof:  For  any  given  execution,  let  h  be  the  first  quiet  phase.  By  Lemma  3.10,  each 
nonfaulty  process  decides  in  some  phase  r  <  fi,  by  time  If  /i  =  0  then  by  Lemma  3.10, 
each  nonfaulty  process  decides  in  phase  0  in  its  first  step  and  the  running  time  is  0.  If  A  =  1 
then  by  Lemma  3.10,  each  nonfaulty  process  decides  in  phase  1  or  0,  and  by  time  tj;  by 
Lemma  3.18  the  running  time  is  ti  —  to  <  Cd  +  d. 

If  h  >  1,  then  we  can  bound  the  time  for  phcises  1 . A  —  1  by  Lemma  3.20,  and  the 

time  for  phase  A  by  Lemma  3.18.  Thus  we  have 

h-l 

th  ~  to  =  y^(4  —  4-i)  +  (t/i  —  th-i) 

r  =  I 

k-l 

<  ^{3  +  br)d  +  {C d  +  d)  (by  Lemmas  3.20  and  3.18) 

r=l 

<  (/  -b  l)3d  -I-  f  ■  d  -j-  {C d  +  d)  (by  Lemma  3.13  and  Cor.  3.15) 

=  4(/+l)d-bCd. 

■ 

For  C  >  4,  it  is  possible  to  construct  an  execution  that  takes  exactly  time  3d  +  4(/  — 
3)d  +  3d  +  Cd  +  d.  In  this  execution,  the  first  phase  takes  time  3d.  the  following  /  —  3  phases 


take  time  4<i,  the  penultimate  phase  takes  -id  and  and  the  last  phase  takes  time  Cd  +  d. 
Each  of  the  phases  taking  \d  develops  when  all  processes  receive  an  r  —  1  message  at  time 
tr-i  and  all  but  one,  Pr-i,  advances  to  phase  r.  Process  decides  on  r  —  1  mod  2  at  fr-i 
(before  it  receives  the  r  —  1  messsage)  and  sends  an  r  message  to  exactly  one  other  process, 
Pr+i,  which  receives  its  acknowledgments  for  its  r  —  1  message  at  time  4-1  +'2d  and  sends  an 
r  message  to  exactly  n  —  f  processes.  By  time  4-i  +  4tf,  all  processes  receive  an  r  message 
and,  except  for  one  process,  pr,  advance  to  phase  r  -f  1.  In  the  following  phase,  at  time 
4  +  4d,  process  pr+i  decides  (the  processes  to  which  pr+i  omitted  an  r  message  run  slowly 
and  do  not  detect  its  failure  until  4-i  +  ‘2d  +  (Cd  +  d)  =  4-i  -f  7d  =  4  +  3d,  so  it  is  not 
shut  down  before  then).  Remaining  details  are  left  to  the  reader. 


3.5.2  Bounds  for  n  <2f 

When  n  <  2/,  we  are  able  to  bound  the  running  time  of  the  algorithm  in  two  ways,  yielding 
one  expression  that  depends  on  the  ratio  and  another  expression  that  depends  on  the 
square  root  of  C.  We  will  use  Lemmas  3.14  (Dr  H  Br'  =  0),  3.16  (the  timeliness  of  acknowl¬ 
edgments),  3.18  (the  time  for  any  phase),  and  3.19  (sequences  of  causal  r  messages),  and 
Corollaries  3.15  (the  sum  of  the  br),  and  3.17  (also  regarding  acknowledgments)  the  proofs 
of  which  did  not  rely  on  the  relative  values  of  n  and  /. 

Bound  dependent  on 

This  bound  requires  a  lemma  about  the  length  of  causal  sequences  of  r  messages  more 
complicated  than  Lemma  3.19.  Processes  not  in  Br  must  send  an  r  message  to  n  —  / 
processes  but  not  necessarily  to  a  nonfaulty  process.  We  therefore  are  not  able  argue  eis 
for  n  >  2/  -f  1  that  phase  r  ends  very  soon  after  a  process  not  in  Br  sends  an  r  message. 
Nevertheless,  disregarding  processes  in  Br  for  the  moment,  if  it  were  true  that  a  process 
could  not  get  «in  acknowledgment  from  another  process  that  already  sent  an  r  message,  then 
it  would  take  at  most  time  (■^^)d  before  a  nonfaulty  process  received  an  r  message.  Our 
algorithm  does  not  exactly  enforce  this  restriction  on  acknowledgments,  but  it  does  prevent 
a  process  from  using  acknowledgments  received  from  a  process  that  previously  omitted  an  r 
message  to  it.  We  are  thus  able  to  derive  a  bound  of  (3;;^  4-  4  +  4)r/  below  in  Lemma  3.23. 
This  argument  is  most  easily  made  by  considering  a  directed  graph  on  the  faulty  processors. 
Accordingly,  for  a  given  execution,  define 

•  directed  graph  G{.  —  (V/ .  El)  where 

Vj  =  {all  processes  that  fail  during  the  given  execution}. 

El  =  {(i.j)  ;  i  sends  an  r  message  to  j:  i.j  6  T/;  i  ^  7}. 


•  =  length  of  the  shortest  path  in  G(  from  i  to  j,  where  i,j  €  Vj ■ 

•  Sr’'~'  =  :  t  sends  an  r  message  by  time  ir-i}. 

•  5"-^  =  (i  :  i  sends  an  r  message  to  a  nonfaulty  process}. 

Claim  3.22  If  phase  r  is  not  quiet  and  no  nonfaulty  process  decides  in  phase  r  —  1,  then 
there  exist  faulty  processes  a  €  and  7  €  such  that  there  is  a  path  in  G{.  from  a 

to  7. 

Proof:  Let  7  be  the  first  process  to  send  an  r  message  to  a  nonfaulty  process.  Process  7 
must  be  faulty:  by  the  choice  of  7,  no  process  sends  an  r  message  to  a  nonfaulty  process 
earlier  than  7  sends  an  r  message  and  therefore  if  7  is  nonfaulty  then  no  process  sends 
an  r  message  to  7  before  it  sends  its  r  message.  Therefore  7  must  decide  in  phase  r  —  1, 
contradicting  our  assumption  that  no  nonfaulty  process  decides  in  phase  r  —  1.  VVe  conclude 
that  7  €  5”-^.  Note  that  7  sends  an  r  message  before  any  nonfaulty  process  does. 

Let  C/  he  the  nodes  in  G^  from  which  node  7  is  reachable  (including  7)  cind  let  a  be 
the  process  such  that  no  process  in  C/  sends  an  r  message  before  a  does.  It  follows  that  a 
sends  an  r  message  no  later  than  7  does.  Because  no  nonfaulty  process  sends  an  r  message 
before  7  does,  a  does  not  receive  an  r  message  from  a  nonfaulty  process  before  sending  its 
r  message.  By  choice,  a  does  not  receive  an  r  message  from  any  faulty  process  before  it 
sends  its  r  message.  We  therefore  conclude  that  o;  receives  no  r  messages  before  sending  its 
own,  and  therefore  must  decide  in  phase  r  —  1,  sending  an  r  message  by  time  U-i  (by  its 
definition).  ■ 

Lemma  3.23  For  r  >  1,  if  for  all  r'  <r  phase  r'  is  not  quiet,  then 

tr  —  tr-l  <  (3^7^  +  br  +  4)d. 

Proof:  We  show  that  by  time  fr-i  +  (3;;^  +  b^  +  4)d,  every  process  receives  an  r  message. 
Thus,  by  that  time,  every  process  that  is  never  blocked  in  any  pheise  r'  <  r  and  is  neither 
decided  nor  shut  down  at  that  time,  has  >  n  —  /  by  Lemma  3.16  (because  it  is  not 

blocked  in  phase  r)  and  therefore  may  advance  to  pheise  r  +  1. 

First  note  that  if  a  nonfaulty  process  decides  in  pheise  r  —  1,  it  does  so  by  time  t^-i, 
broadcasting  an  r  message  that  is  subsequently  received  by  all  processes  by  time  tr-i  -I-  d, 
and  the  lemma  is  proved.  So  we  consider  the  case  that  no  nonfaulty  process  decides  in 
phase  r  —  1. 

Lemma  3.22  applies  for  this  case:  it  says  that  there  exist  processes  i  €  H  V/  and 
j  €  5”-^  n  V/  such  that  j  is  reachable  from  i  in  G{..  Let  q  €  5}’'“*  and  7  €  5"-^  be  a  closest 
pair  of  nodes  in  G/: 

6/(a,7)=  min 

.es/-'nv/ 
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We  first  bound  the  time  by  which  7  broadcasts  its  r  message.  Fix  a  minimal  length  path 
from  a  to  7  and  let  0  be  the  last  process  on  that  path  that  sends  an  r  message  by  time 
tr-i  +  2d.  We  claim  that  7  broadcasts  its  r  message  by  tiine  tr-i  +  +  2)d.  By 

the  choice  of  0,  each  process  j  on  the  path  from  0  to  7  sends  an  r  message  later  than  time 
tr-\  +  2d  and  therefore  by  Corollary  3.17  has  >  n  —  f  by  time  tr-i  +  2d  .  By  time 

tr-i+Sd,  the  process  on  this  path  after  0  receives  an  r  message  (from  0)  and  thus  broadcasts 
an  r  message.  Similarly,  for  j  >  1,  the  process  on  the  path  after  0  sends  an  r  message 
by  time  tr-i  +  2d  +  jd  and  process  7  sends  an  r  message  by  time  tr-i  +  (2  +  6^{0,^))d. 

We  now  show  that  by  time  +  (S^{0,'y)  +  4)d,  all  processes  receive  an  r  message.  A 
nonfaulty  process  rj  receives  an  r  message  from  7  by  tr-i  +  (3  +  Sf{0,'^))d.  Because  no  phase 
r'  <  r  is  quiet,  it  follows  from  Lemma  3.4  that  rj  does  not  decide  in  any  phcise  r'  <  r  —  2.  If 
Tj  decides  in  phase  r  —  1,  then  it  does  so  by  time  ir-i,  sending  an  r  message  as  it  does;  all 
processes  receive  it  by  time  tr-i  +  d.  If  t)  advances  to  phase  r  without  deciding,  then  it  does  so 
by  time  tr-\.  By  Corollaries  3.6  and  3.17,  Tj  has  >  n  —  /  by  time  <r-i  +  (3  +  ^/(/3, 7))d. 

By  this  time,  tj  has  received  an  r  message  from  7  and  therefore  if  tj  has  not  yet  sent  an  r 
message — if  has  not  yet  advanced  from  phase  r  or  has  decided  in  phase  r  and  advanced  to 
phase  r  +  2  but  not  yet  sent  an  r  message — it  may  then  send  an  r  message.  An  r  message 
is  then  received  by  ail  processes  by  time  tr-i  +  (6/(/^,  7)  -t-  4)d. 

To  complete  the  proof,  we  now  show  6l{0,^)  <  +  6r.  Let  k  =  Sl{0,^)  and  let 

Li  =  {p  :  S^{0,p)  =  j}  for  1  <  i  <  k.  Define  Lq  =  {/?}  and  L-x  =  0.  Consider  the  sum 

*-i 

<7  =  ^  |I,-i  ULiULi+i|. 

1=0 

Since  the  sets  Li  are  disjoint,  each  node  in  G{.  is  counted  at  most  3  times,  socr  <3|g;i<3/. 

Claim  3.24  For  0  <  r  <  A:  —  1,  at  least  k  —  br  of  the  sets  Li^x  U  L,  U  L,+i  has  cardinality  at 
least  n  —  f. 

Proof:  Clearly,  for  i  <  k  —  I,  no  set  Li  is  empty,  since  7,  at  distance  k  from  0,  receives 
an  r  message  from  a  faulty  process.  At  least  k  —  br  sets  L,  contain  a  process  j  ^  Br  such 
that  j  is  on  the  path  from  to  7.  For  each  j,  and  each  process  i  in  clearly,  j  sends 
f  an  r  message;  we  will  show  that  if  j  is  on  tlie  chosen  path  from  0  to  7,  then  process  £ 
sends  j  an  r  message  also.  We  will  also  show  that  if  j  €  L,  where  i  <  k  —  1.  then  £  is  faulty 
and  therefore  in  Gl-  Thus,  for  all  £  in  such  that  j  €  L,,  there  are  edges  of  Gl  in  both 
directions  between  j  and  £  and  if  j  ^  Br,  then  \L,-x  U  Z,,  U  >  n  —  f.  completing  the 
proof. 

We  first  show  that  if  j  €  Li  where  i  <  k  ~  i,  then  £  €  AJ  is  faulty.  If  £  were  nonfaulty, 
then  j  would  be  in  5"-^.  But  j  cannot  be  in  5"-^,  since  7,  at  distance  k,  was  defined  to  be 
the  closest  node  to  a  in  5"-^  fl  V/  but  j  €  Li  is  at  distance  i  <  k. 


3.5 


We  next  show  that  for  each  j  on  the  chosen  path  from  /?  to  7,  if  ^  then  i  sends  j  an 

r  message.  Consider  first  the  Ccise  that  (.  broadcasts  an  r  message  before  sending  '‘ack(j',  r)” 
to  j.  Since  £  €  /4J,  £  does  not  omit  a  message  to  j  before  sending  "ack(jf, r)”;  in  particular, 
it  does  not  omit  the  r  message.  Consider  then  the  case  that  C  does  not  send  an  r  message 
before  sending  “ack(j,  r)”  to  j.  By  the  choice  of  0,  j  sends  its  r  message  (and  £  receives  it) 
later  than  time  +  '2d.  Because  £  sends  an  ‘'ack(j,  r)”  message,  £  either  decides  in  phase 
r  —  1  or  advances  to  phase  r  without  deciding.  However,  £  cannot  decide  in  phase  r  —  1: 
processes  that  decide  in  phase  r  —  1  do  so,  sending  an  r  message  by  time  tr-i,  but  we  are 
assuming  £  does  not  send  an  r  message  before  sending  “ack(j,r)”,  which  is  later  than  time 
tj._i  +  2d.  Thus,  at  the  time  that  £  receives  the  r  message  from  j,  £  is  either  in  phase  r, 
not  yet  having  sent  an  r  message,  or  decided  and  in  phase  r  +  2,  not  yet  having  sent  an  r 
message.  In  its  first  step  after  receiving  the  r  message  from  j,  process  £  queues  both  “r”  and 
“ack(j, r)”.  Because  £  €  Aj,  this  message  is  not  omitted,  so  £  sends  an  r  message  to  j.  ■ 

Thus  we  have  {k  —  br){n  —  f)  cr  and  a  <  3/,  or  (5/(/3,7)  =  k  <  +  6r)  which 

completes  the  proof;  all  processes  receive  an  r  message  by  time  +  4)d.  ■ 


Theorem  3.25  For  n  <  2/,  the  algorithm  above  solves  the  consensus  problem  for  f  omis¬ 
sion  failures  within  time  (3^^  -H  5)(/  +  l)d  Cd. 

Proof;  For  any  given  execution  of  the  algorithm  in  which  h  is  the  first  quiet  phase,  by 
Lemma  3.10,  each  nonfaulty  process  decides  in  some  phase  r  <  h,  by  time  t^.  Agaun,  if 
h  —  0  then  by  Lemma  3.10,  each  nonfaulty  process  decides  in  phase  0  in  its  first  step  and 
the  running  time  is  0.  If  /i  =  1  then  by  Lemma  3.10,  each  nonfaulty  process  decides  in  phase 
1  or  0,  and  by  time  ti]  by  Lemma  3.18  the  running  time  is  ti  —  to  <  Cd  d. 

If  /i  >  1,  we  can  bound  the  time  for  phases  1, . . . ,  /i  —  1  by  Lemma  3.23,  and  the  time  for 
phase  h  by  Lemma  3.18.  Thus  we  have 


th  —  to 


h-l 

=  —  tr-l)  +  (t/j  — 

r=l 
h-l 

<  y^(4  +  3^^  +  br)d  +  {Cd  4-  d)  (by  Lemmas  3.23  and  3.18) 

r=l 

<  (/  +  1)(4  +  ‘i-^)d  +  f  •  d  +  (Cd  +  d)  (by  Lemma  3.13  and  Cor.  3.15) 
=  (/ 4- 1)(5  +  3;^)d  +  Cd. 
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Bound  dependent  on  s/C 

The  analysis  of  the  previous  section  shows  that  the  running  time  of  our  algorithm  is 

(5  +  3;^)(/  \  )d  Cd. 

Note  that  if  /  is  close  to  n,  say  n  =  /  +  i,  then  the  bouna  is  roughly  proportional  to  pd, 
which  is  no  improvement  on  the  algorithm  of  [ADLS90].  However,  for  these  proportional 
values  of  n  and  /,  we  are  better  able  to  bouna  the  running  time.  The  analysis  of  this  section 
shows  that  the  running  time  of  our  algorithm  is  also  bounded  by 

{2Vc  +  e){f  +  i)d  +  cd. 

So  when  2\/C  +  6  <  5  +  3^^,  or  uughly  n  <  J  +  fjs/C,  we  have  a  better  bound  on  the 
running  time. 

Define  a  partition  the  first  r  phases  of  the  given  execution  into  two  classes  according  to 
their  length: 

Xr  =  {p  ■.  tp  —  <  \fC  ■  d  and  p  <  r)  =  {short  phases} 

Yr  =  {p  '  P  ^  and  p  <  r\  =  (long  phcises}. 

and  define 

5/  =  {i  :  i  omits  an  r  message  to  a  nonfaulty  process  afte  time 

We  can  bound  the  short  phases  by  their  defined  bound,  but  bound  the  long  phaises  by 
chains  of  r  messages,  via  the  following  two  lemmas. 

Lemma  3.26  If  phase  r  >  1  is  not  quiet  then  either  tr  <  tr-i  +  +  3)d  or 

all  nonfaulty  processes  decide  by  this  time. 

Proof:  We  once  again  show  that  by  time  tr-i  +  (15'/|  +  3)d.  all  processes  receive  an  r  message 
and  thus  by  this  time  are  either  decided,  shut  down,  blocked  in  some  phase  r'  <  r,  or  may 
advance  to  phase  r  +  1 . 

By  Lemma  3.19,  we  know  that  for  any  process  io,  there  is  a  sequence  of  distinct  processes 
io,  ii, ...  ,ijc  such  that  k  is  the  only  process  in  the  sequence  to  broadcast  an  r  message  before 
time  tr-i,  and  for  0  <  j  <  i,  by  time  +  {I  +  k  —  j)d,  ij  sends  the  first  r  message, 
received  by  ij_i. 

Now,  if  k  <  |5/|  then  io  receives  mo  by  time  4-i  +  (A;  —  l  +  l)d  +  d  <  +  l)d.  If 

k  >  l^/l,  then  there  is  a  j  such  that  0  <  A-  —  |5'/|  <  j  <  k  and  ij  ^  5/.  Process  ij  therefore 
sends  an  r  message  to  all  nonfaulty  processes  by  time  tr-i  +  {l+k  —  j)d<  tr-i  +  (1  +  |•5'/|)d■ 


If  some  nonfaulty  process  has  not  yet  decided,  then  it  sends  an  r  message  to  all  other 
processes  by  time  tr-i  +  (2  +  \Sl\)d  and  process  io  receives  an  r  message  by  time  +  (3  + 

15/IK  ■ 

The  key  observation  is  that  a  process  cannot  fail  to  a  nonfaulty  process  in  many  long 
phases; 

Lemma  3.27  For  any  execution  of  the  protocol  taking  at  least  4>  phases  and  for  any  process 
j ,  there  are  at  most  \/C  +  3  phases  p  6  ¥4,  such  that  j  E  S^. 

Proof:  If  j  omits  an  k  message  to  a  nonfaulty  process  at  time  t  then  by  Lemma  3.1,  that 
nonfaulty  process  detects  j’s  failure  by  time  t  +  Cd  +  d,  broadcasting  ‘‘shutdown  j”  at  that 
time.  We  have  t  <  t^,  and  so  j  is  shut  down  by  time  t  +  Cd  +  2d  <  tk  +  {y/C  -f-  2)\/Cd.  It 
follows  that  there  are  at  most  >/C  -f  2  long  phases  £  such  that  tk  <  t/  <  Cd  +  2d.  Thus  j 
cannot  attempt  to  send  (and  cannot  omit)  an  £  +  1  me^’a  ■>  after  time  tf  and  is  therefore 
not  in  for  any  p  >  L  ■ 


Theorem  3.28  For  n  <  2f,  the  algorithm  above  solves  the  consensus  problem  for  f  omis¬ 
sion  failures  within  time  {2\/C  +  6)(/  +  l)d  +  Cd. 

Proof:  Let  phase  h  be  the  first  quiet  phase.  .Again,  if  A  =  0  then  by  Lemma  3.10,  each 
nonfaulty  process  decides  in  phase  0  in  its  first  step  cind  the  running  time  is  0.  If  =  1  then 
by  Lemma  3.10,  each  nonfaulty  process  decides  in  phase  1  or  0,  by  time  ti,  by  Lemma  3.18 
the  running  time  is  ^  Cd  +  d. 

If  /i  >  1,  we  consider  two  cases.  Consider  first  the  case  that  not  all  nonfaulty  processes 
decide  in  pheise  h  —  2.  We  bound  the  length  of  the  short  phases  by  their  defined  length. 
We  bound  the  length  of  the  long  phases  by  Lemma  3.26  and  then  sum  the  sizes  of  using 
Lemma  3.27.  The  length  of  phase  h  is  bounded  by  Lemma  3.18.  Thus  we  have 


th  —  to 


yz  i^p  ~  h-i) + —  th-i) 

p€Xh-i  p€Ka-i 


<  IXh-iI  ■  y/Cd-hd  (.3  +  15/1) +  (Cd  +  d)  (by  Lemmas  3.26  and  3.18) 

peVn-i 

<  jXh-il  ■  \/Cd  +  SlVh-ild -i- f(\/C -h  ■i)d -h  (Cd d)  (by  Lemma  3.27) 

<  (/ +  1)\/Cd  +  3(/ +  l)f/ + /(\/C  +  3)d  +  C(/ +  d  (by  Lemma  3.13) 

<  ( 2 \/C  +  6)(  /  +  l)d  +  C  d. 
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Now  consider  that  case  that  all  nonfaulty  processes  decide  in  phase  h  —  2.  The  running  is 
then  bounded  by  th-2  —  W- 


th-2  —  to 


^  {tp-tp-i)+  ^  {tp-tp.i) 
p6A'/,_2  p€V;»-2 


<  i.Y»_j|  VCrf+  Y1  (3  +  15/IM 

P^Yh-l 


(by  Lemma  3.26) 


<  |Xa_2|  •  +  3|y/i_2|d  + /(\/C  +  3)d  (by  Lemma  3.27) 

<  f\/Cd  +  3fd  +  f\/Cd  +  'ifd  (by  Lemma  3.13) 

=  i2VC  +  6)fd 
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Chapter  4 

Consensus  in  the  Presence  of 
Byzantine  Failures 


In  this  chapter  we  present  a  simulation  algorithm  using  3/  -|-  1  processes  and  tolerating  / 
Byzantine  failures.  The  algorithm  simulates  any  synchronous  round-based  algorithm  tolerant 
of  /  Byzantine  failures  and  uses  time  r{d  +  2Cd)  -|-  Cd,  where  r  is  the  number  of  rounds 
required  by  the  synchronous  algorithm. 

The  simulation  works  by  keeping  processes  loosely  synchronized  to  ensure  that  a  nonfaulty 
process  does  not  advance  to  round  r  until  it  has  received  a  round  r  —  1  message  from  every 
nonfaulty  process.  The  partial  synchronization  works  by  using  a  combination  of  two  criteria 
for  advancing  to  further  phases,  one  based  on  elapsed  local  time  and  the  other  bcised  on 
messages  received.  A  similar  technique  is  used  in  (WL88]  to  initiate  new  rounds  of  clock 
resynchronization.  In  particular,  our  criteria  for  ending  round  1  is  essentially  the  same  as 
the  criteria  used  in  [VVL88]  for  ending  every  round;  our  criteria  for  subsequent  rounds  is 
different. 


4.1  The  simulation  algorithm 

The  algorithm  simulates  a  synchronous  algorithm  by  ensuring  that  each  nonfaulty  process 
receives  all  round  r  messages  of  the  synchronous  algorithm  from  all  other  nonfaulty  processes 
before  advancing  to  round  r  -f  1.  We  do  not  explore  here  the  formal  semantics  of  “a  correct 
simulation”;  rather  we  regard  as  sufficient  the  following  correspondence  ensured  by  the  above 
property:  For  every  execution  of  the  simulation,  there  is  an  execution  of  the  round-b^lsed 
synchronous  algorithm  in  which  the  nonfaulty  processes  receive  the  same  vector  of  messages 
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from  each  other  at  each  round.  Since  the  behavior  of  faulty  processes  is  not  restricted,  clearly 
the  corresponding  synchronous  execution  is  legal. 

Therefore,  for  the  purposes  of  simulation,  we  define  a  synchronous  algorithm  by  its  mes¬ 
sage  function  only,  suppressing  information  about  the  state  of  the  synchronous  algorithm. 
Let  Mi{r,V’'~^)  denote  the  vector  of  messages  to  be  sent  in  the  synchronous  algorithm  by 
process  i  in  round  r  when  the  ordered  set  of  messages  is  received  in  round  r  —  1  (of 

course,  this  message  function  may  also  depend  on  the  state  of  the  process;  we  leave  this 
implicit).  Without  loss  of  generality,  assume  each  process  sends  a  message  to  all  processes 
at  every  round  of  the  synchronous  algorithm. 

Recall  that  we  assume  all  processes  begin  executing  the  algorithm  at  the  same  time.  At 
each  step,  a  process  increments  a  counter  s  (initially  0)  and  executes  the  code  in  Figure  4.1, 
explained  below.  A  local  variable,  initially  1,  keeps  track  of  the  ROUND  number.  Ordered 
set  contains  the  message  received  from  each  process.  We  refer  to  the  message  sent 
by  a  process  eis  a  “round  r”  message.  (Recall  that  we  assume  each  process  sends  a  message 
to  all  processes  in  every  round  of  the  synchronous  algorithm.) 

Each  process  first  sends  its  round  1  message  and  then  waits  for  at  least  time  d  to  ensure 
that  it  receives  a  round  1  message  from  every  other  nonfaulty  process.  When  it  can  be  sure 
that  time  d  has  elapsed,  it  advances  to  round  2  and  broadcasts  its  round  2  message  based 
on  the  round  1  messages  it  has  received  so  far.  It  ensures  that  time  d  has  passed  by  either 
waiting  for  d/ci  of  its  own  steps  or  by  receiving  f  +  I  round  2  messages — this  ensures  that 
some  nonfaulty  process  has  waited  at  least  time  d. 

In  subsequent  each  round  r,  a  process  waits  for  at  least  time  2d  (actually  2d  +  802)  after 
at  least  /  -f-  1  nonfaulty  processes  have  sent  a  round  r  message.  By  this  time,  all  nonfaulty 
processes  must  have  received  at  least  /  -f  1  round  r  messages  and  therefore  advanced  to 
round  r  and  sent  a  round  r  message.  At  this  time,  a  process  advances  to  round  r  -|-  1  and 
broadcasts  its  round  r  -I-  1  message.  Again,  there  are  two  ways  for  a  process  to  deduce 
that  sufficient  time  has  passed:  if  it  takes  {2d  -|-  3c2)/ci  steps  after  receiving  at  least  2/  -|- 1 
round  r  messages  or  if  it  receives  at  lecist  /  -f-  1  round  r  +  1  messages.  The  latter  ensures 
that  some  nonfaulty  process  has  advanced  to  round  r  -f-  1  and  therefore  has  already  waited 
a  sufficient  amount  of  time  (at  least  time  2d  after  at  least  /  -I-  1  nonfaulty  processes  sent  a 
round  r  message). 


4.2  Correctness 

Let  tr  be  the  latest  time  that  any  nonfaulty  process  sends  a  round  r  message.  Again,  we 
assume  that  all  processes  begin  at  the  same  time  (here,  <i).  We  say  a  process  “advances 
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Round  1  Send  goto  Round  1'. 

Round  1'  If  s  >  d/ci  or  jV'^j  >  /  -f  1, 
then  goto  Round  2 

Round  r  Send  M(r,  goto  Round  r'. 

Round/  If  IV'"!  >  2/ +  l, 

then  5  <—  0;  goto  Round  r". 

Round  /'  If  s  >  (2d  +  3c2)/ci  or  >  /  +  1, 

then  goto  Round  r  +  1. 

Figure  4.1:  The  simulation  of  a  synchronous  algorithm.  At  each  step,  a  process 
increments  the  counter  s  and  executes  the  code  according  to  its 
present  round  number.  is  the  ordered  set  consisting  of  the 
message  received  from  each  process.  A/(r,  denotes  the 
message  function  of  the  synchronous  algorithm  for  round  r. 

to  round  r”  when  it  executes  the  “goto  ROUND  r”  statement  in  the  code.  In  order  to 
prove  correctness,  we  must  show  that  a  nonfaulty  process  eventually  advances  to  all  rounds 
required  by  the  synchronous  algorithm  and  always  receives  a  round  r  message  from  all 
nonfaulty  processes  before  advancing  to  round  r  +  1. 

Lemma  4.1  Each  nonfaulty  process  advances  to  all  rounds  required  by  the  synchronous  al¬ 
gorithm. 

Proof:  By  induction  on  the  round  number.  Clearly  each  nonfaulty  advances  to  round  2 — it 
advances  to  round  1'  after  its  first  step  and  advances  to  round  2  after  at  most  1  +  d/cj  more 
steps. 

For  r  >  2,  aissume  all  nonfaulty  processes  have  advanced  to  round  r.  Then  all  nonfaulty 
processes  have  sent  a  round  r  message  and  advanced  to  round  r'.  Since  n  —  f  >  2/  +  1, 
there  are  at  least  2/  +  1  nonfaulty  processes,  so  each  nonfaulty  process  eventually  receives  at 
least  2/  +  1  round  r  messages  and  advances  to  round  r".  .After  at  most  (2d  +  3c2)/ci  steps 
in  round  r",  each  nonfaulty  process  advances  to  round  r  -|-  1-  ■ 

Lemma  4.2  i\o  nonfaulty  process  advances  to  round  r  +  1  before  receiving  a  round  r  message 
from  each  nonfaulty  process. 
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Proof:  By  induction  on  the  round  number  r. 

r  =  1:  Each  nonfaulty  process  takes  more  than  d/ci  steps  before  advancing  to  round  2. 
Thus  each  advances  later  than  time  +  d,  which  is  after  the  round  1  message  of  each 
nonfaulty  process,  sent  at  time  ti,  is  delivered. 

r  >  1:  Assume  the  lemma  is  true  for  ?•  —  1  (i.e.,  no  nonfaulty  process  advances  to  round 
r  before  receiving  a  round  r  —  1  message  from  each  nonfaulty  process).  We  show  the  lemma 
is  true  for  r.  Let  i  be  the  first  correct  process  to  advance  to  round  r  +  1  and  let  r,  be  the 
time  at  which  i  advances  to  round  r"  (by  Lemma  4.1,  this  time  is  well-defined).  We  make 
the  following  series  of  deductions  about  the  events  that  occur  at  or  before  the  listed  times: 

Ti  :  Because  i  is  in  round  r",  by  the  induction  hypothesis,  i  has  received 
a  round  r  —  1  message  from  all  nonfaulty  processes.  Because  i  has 
advanced  to  r",  it  has  received  at  least  2/  -1-  1  round  r  messages. 

Ti  +  d  :  All  nonfaulty  processes  are  in  round  (r  —  1)'  or  greater  (because  they 
have  each  sent  an  r  —  1  message  to  i)  and  have  received  at  least  2/  -I-  1 
round  r  —  1  messages  (from  each  other). 

Ti  +  d  +  C2  :  All  nonfaulty  processes  therefore  advance  to  round  (r  —  1)". 

Ti  +  d  +  2c2  :  All  nonfaulty  processes  have  received  at  least  /  -f  1  round  r  messages 
(from  the  nonfaulty  subset  of  processes  that  sent  round  r  messages  to  i) 
and  therefore  advance  to  round  r. 

Ti  +  d  +  3c2  :  All  nonfaulty  processes  send  a  round  r  message  and  advance  to  round  r'. 

Ti  +  2d  +  3c2  :  All  processes  receive  a  round  r  message  from  each  nonfaulty  process. 

Because  by  choice  i  is  the  first  nonfaulty  process  to  advance  to  round  r  -|-  1,  it  follows 
that  i  advances  to  round  r  -f  1  only  after  (2d  -f  3c2)/cj  steps  in  round  r",  which  occurs  later 
than  time  r,  +2d  +  3c2.  We  conclude  that  i  receives  a  round  r  message  from  each  nonfaulty 
process  before  advancing  to  round  r-|- 1.  Since  all  nonfaulty  processes  advance  to  round  r-|- 1 
after  i,  they  also  receive  a  round  r  message  from  all  nonfaulty  processes  before  advancing.  ■ 

4.3  Analysis  of  time  bounds 

Again,  we  assume  d  ^  C2  and  therefore  approximate  d  +  C2  by  d  in  the  timing  analysis. 
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Lemma  4.3  t2  —  ti  <  Cd  and,  for  r  >  2.  tr+i  —  tr  <  d  +  2Cd. 

Proof:  Clearly,  (2  <  t\  +  Cd.  By  time  tr  all  uonfaulty  processes  send  a  round  r  message 
and  advance  to  round  r'.  Therefore  by  time  tr  +  d,  all  uonfaulty  piocesses  receive  at  least 
2/  +  1  round  r  messages  and  advance  to  round  r".  Within  another  time  '2Cd.  all  nonfaulty 
processes  have  taken  {2d  +  3c2)/ci  steps  and  advanced  to  round  r  +  1,  sending  an  r  +  1 
message.  ■ 

Theorem  4.4  There  is  an  algorithm  using  3/  +  1  processes  which  solves  the  consensus 
problem  for  f  Byzantine  failures  within  time  Cd  +  f{d  +  2Cd)  =  fd  +  {2f  +  l)Cd. 

Proof:  Any  (/+  l)-round  synchronous  algorithm  can  be  simulated.  Agreement  and  validity 
follow  from  correct  simulation.  Termination  follows  from  Lemma  4.1.  The  time  bound 
follows  from  Lemma  4.3.  ■ 
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Chapter  5 

Bounded-capacity  Message  Links  and 
Failure  Detection 


In  fault-oolerant  distributed  algorithms,  a  common  primitive  for  detecting  failures  is  to  “time 
out”  failed  processors.  If  processors  fail  by  simply  stopping,  then  a  failure  may  be  detected 
by  the  absence  of  messages  from  a  processor.  In  this  chapter,  we  consider  how  quickly  such 
failures  can  be  detected  in  our  semi-synchronous  model. 

If  it  is  assumed  that  all  mess?ges  sent  are  delivered  within  time  d  of  when  they  are  sent, 
then  the  following  simple  protocol  minimizes  the  time  between  any  failure  and  its  detection. 
(This  is  the  strategy  employed  in  the  algorithm  of  (ADLS90]  and  our  algorithm  of  Chapter  3.) 
Each  processor  broadcasts  a  mes.«age  at  every  step  that  it  takes.  If  no  message  is  received 
from  another  processor  for  more  than  (d  -f  02)! C\  local  steps,  that  processor  is  declared 
faulty.  Because  local  steps  are  separated  by  at  leeist  time  Ci,  at  leaist  time  d  +  C2  paisses 
before  this  many  steps  are  taken.  Because  local  steps  are  separated  by  at  most  time  cj,  the 
time  between  the  delivery  of  tiny  two  consecutive  messages  sent  by  a  processor  is  at  most 
d-\-C2.  It  follows  that  only  f^uled  processors  are  declared  faulty.  The  maximum  time  between 
any  failure  and  its  detection  is  approximately  Cd  +  d,  occurring  in  the  following  scenario: 
processor  p  broadcasts  a  message  at  time  t  and  then  fails;  these  messages  are  delivered  at 
time  t  +  d\  every  other  processor  runs  slowly  (its  steps  separated  by  C2)  after  t  +  d,  and  thus 
p's  failure  is  not  detected  until  time  t  +  d  C2{d  4-  C2)/ci  ^  t  d  +  Cd. 

Although  the  above  protocol  guarantees  minimal  delay  between  any  failure  and  its  detec¬ 
tion,  it  is  clearly  inefficient  in  its  use  of  messages.  It  takes  advantage  of  the  strong  assumption 
that  all  messages  are  delivered  within  time  d,  regardless  of  the  rate  at  which  they  are  sent. 
In  reality,  the  performance  of  a  message  link  may  suffer  if  messages  are  sent  too  frequently. 
In  this  chapter,  we  propose  a  model  of  message  links  with  bounded  capacity  and  analyze  the 
effect  of  the  capacity  bound  on  the  efficiency  of  detecting  stopping  failures. 


5.1  Modeling  bounded-capacity  links 


Define  a  message  link  of  unit  capacity  and  delay  d  as  a.  communication  channel  that  queues 
incoming  messages  in  FIFO  order  and  delivers  the  first  message  in  the  (pieue  within  time 
d  of  the  later  of  when  the  message  is  sent  and  when  the  previous  message  i>  delivered. 
(For  simplicity,  we  will  assume  that  message  links  deliver  messages  in  the  order  sent.  Our 
algorithms  do  not  make  use  this  assumption  and  our  lower  bounds  hold  in  spite  of  it.)  For 
positive  integer  p,  define  a  message  link  of  capacity  p  and  delay  d  a.s  the  composition  of  p 
message  links  each  of  unit  capacity  and  delay  d/p,  connected  serially  so  that  messages  are 
delivered  from  link  i  to  link  i  +  1  for  1  <  i  <  /r  and  link  p  delivers  messages  to  the  recipient 
process.  Messages  are  neither  lost  by  a  link  nor  delivered  out  of  order,  and  once  a  processor 
has  sent  a  message,  it  cannot  cancel  that  message. 


Thus,  in  the  absence  of  any  other  message  traffic,  the  delay  of  a  single  message  is  bounded 
by  p  •  d/p  =  d.  Note  that  if  a  single  component  link  delays  all  messages  by  the  maximum 
amount,  d/ p,  then  messages  are  delivered  at  a  maximum  rate  of  p  messages  per  time  d.  In 
particular,  it  is  easy  to  see  that  if  the  last  component  link  delays  each  message  by  d/ p,  then 
for  any  interval  of  time  of  length  /,  at  most  messages  are  delivered. 


On  the  other  hand,  if  no  two  messages  are  sent  within  time  d/ p  of  each  other,  then  each 
message  is  delivered  within  time  d  of  when  it  is  sent.  This  is  easily  seen  by  an  induction  on 
the  number  of  messages  sent.  Assume  the  previous  message  is  delivered  by  the  sublink 
within  time  i  •  d/p  of  when  it  is  sent  (clearly  this  is  true  for  the  “first”  message  ever  sent). 
If  message  m  is  sent  at  time  t,  then  for  0  <  i  <  p,  hy  time  t  +  i  ■  d/ p  the  previous  message 
has  been  delivered  by  link  i  -F  1  and  m  is  delivered  by  link  i  (by  induction  on  i).  Thus  m  is 
delivered  to  the  recipient  process  within  time  p  ■  d/p  =  d  ol  when  it  is  sent. 


For  the  lower  bound,  we  assume  only  that  in  the  worst  case,  a  link  delivers  every  pair  of 
messages  at  lecist  time  d/ p  apart. 


5.2  Timing  out  failed  processors 

VVe  will  consider  a  system  of  processors  fully  connected  by  message  links  of  capacity  p  and 
delay  d.  These  processors  may  fail  by  stopping.  A  process  is  said  to  detect  the  failure  of 
another  processor  when  it  irrevocably  decides  that  the  other  has  failed.  A  timeout  protocol 
is  correct  if  it  satisfies  two  properties  for  all  e.xecutions  and  all  processors  p  and  q:  (1)  if  p 
fails  and  q  does  not  fail,  then  q  eventually  detects  the  failure  of  p,  and  (2)  if  neither  p  nor  q 
fails,  then  neither  p  nor  q  detects  the  failure  of  the  other. 


For  a  given  execution  q,  we  say  that  p  detects  the  failure  of  q  within  time  T  in  a  if  ^ 
fails  at  time  ^  in  a  and  p  detects  the  failure  of  q  at  lime  t'  <  t  +  T  in  a..  VVe  say  a  timeout 
protocol  guarantees  a  detection  time  of  T  if  for  all  processors  p  and  q  and  all  executions  a 
in  which  p  fails  but  q  does  not,  q  detects  the  failure  of  p  within  time  T  in  q. 

Because  in  our  model  each  pair  of  processors  is  connected  by  a  private  bidirectional 
message  link,  we  will  assume  that  the  timeout  protocol  executes  independently  for  each 
pair  of  processors.  We  will  therefore  prove  bounds  on  detection  time  for  a  system  of  two 
processors,  p  and  q. 


5.2.1  Upper  bounds  on  detection  time 

An  upper  bound  of  2Cd  +  d  is  achieved  by  a  simple  protocol  that  works  for  any  link  capacity. 
The  two  processors  continually  exchange  a  single  token  message:  when  p  receives  the  token 
message  from  q,  it  sends  a  token  message  back  to  q,  and  q  does  likewise.  If  a  processor 
takes  more  than  (2d  +  C2)/ci  steps  without  receiving  a  message,  it  concludes  that  the  other 
processor  is  faulty.  Because  there  is  at  most  one  message  in  tr£insit  at  any  time,  it  is  always 
delivered  within  time  d  of  when  it  is  sent.  Clearly  a  nonfaulty  processor  is  never  timed  out. 
This  protocol  guarantees  that  any  failure  is  detected  within  time  2Cd  +  d  (to  be  precise, 
d  +  C{2d  +  C2)  +  C2;  recall  we  approximate  d  +  C2  ^  d):  if  p  fails  at  time  t,  then  by  time  f  -f  d 
all  of  the  messages  it  has  sent  are  delivered  to  q  and  q  has  sent  its  last  message  to  p;  within 
another  time  C2(l  +  (2d  +  C2)/ci)  w  2Cd,  q  has  taken  enough  steps  to  conclude  that  p  has 
failed. 

An  upper  bound  of  C^d/ p  +  Cd  +  d  is  achieved  by  a  protocol  in  which  each  processor 
sends  a  message  every  {d/p)/ci  steps.  A  process  concludes  that  the  other  hais  failed  if  it 
has  taken  more  than  {Cdfp  +  d)/ci  steps  without  receiving  a  message.  Clearly,  the  sending 
times  of  every  two  messages  are  separated  by  at  least  time  d/p  and  therefore,  as  shown 
in  Section  5.1,  each  message  is  delivered  within  time  d  of  when  it  is  sent.  The  maximum 
ajnount  of  time  between  the  delivery  of  two  consecutive  messages  from  a  given  processor  is 
then  C2(d//i)/ci+d  =  Cd/ p-\-d  (if  the  first  message  is  delivered  immediately  and  the  following 
message  incurs  the  maximum  possible  delay,  d).  This  is  less  than  the  minimum  amount  of 
time,  Cd//i  +  d  +  Cl,  that  the  other  processor  Wciits  before  detecting  failure.  This  protocol 
guarantees  a  detection  time  of  C^d/  p  +  Cd+d:  if  p  fails  at  time  t,  then  by  timet -I- d  all  of  the 
messages  it  has  sent  are  delivered  to  q-,  within  another  time  C2(Cdf p  +  d)/ci  =  C^d/p  +  Cd, 
q  heis  taken  enough  steps  to  conclude  that  p  has  failed. 

Thus  we  obtain  a  simple  upper  bound  of  min(2Cd  +  d,  C^dfp  +  Cd  +  d).  Note  that 
2Cd  +  d  <  C^dj p  -I-  Cd  +  d  if  and  only  \[  p  <  C . 
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5.2.2  Lower  bounds  on  detection  time 


We  now  prove  a  nearly  corresponding  lower  bound  of  min(26'd  +  d///,  C^d//x  -I-  Cd  +  d). 
Note  that  2Cd  -b  dj <  C^dj fx  +  Cd  +  d  if  and  only  if  /z  <  C  +  1.  Thus,  the  bounds  are 
tight  except  for  /z  <  C  +  1:  when  C  <  ;z  <  C  +  1,  C^df +  Cd  +  d  is  the  best  upper  bound 
and  2Cd  +  d/fx  is  the  best  lower  bound;  when  /z  <  C,  2Cd  +  d  is  the  best  upper  bound  and 
2Cd  +  df  fx  is  the  best  lower  bound. 

We  first  prove  that  there  exists  some  execution  in  which  p  runs  ‘‘fast”  (its  steps  separated 
by  time  Ci ),  9  runs  “slowly”  (its  steps  separated  by  time  cj ),  messages  from  qtop  are  delivered 
immediately,  messages  from  pto  q  are  delayed  by  at  le£ist  time  d,  and  some  pair  of  consecutive 
messages  from  p  to  q  are  separated  by  at  least  time  d/p.  We  prove  that  such  an  execution 
exists  for  any  protocol  guaranteed  to  detect  failures  within  any  bounded  amount  of  time. 
This  is  proved  below  using  the  properties  of  the  bounded-capacity  message  links.  The  idea 
is  that  if  the  last  component  link  from  p  to  q  delays  all  messages  by  df  p  then  the  delivery  of 
every  pair  of  messages  is  separated  by  time  df  p.  Therefore,  if  each  pair  of  messages  sent  by 
p  were  separated  by  less  than  d//z,  then  messages  would  be  sent  (put  onto  the  link)  faster 
than  they  were  delivered  (removed  from  the  link).  Thus  the  number  of  messages  sent  but 
undelivered  and,  consequently,  the  total  delay  of  a  message,  would  grow  in  time  without 
bound.  The  time  between  when  p  fails  and  when  q  receives  no  further  messages  is  therefore 
increased  without  bound. 

Lemma  5.1  For  all  B  and  for  any  correct  timeout  protocol  that  guarantees  a  detection  time 
of  B,  there  exists  an  execution  in  which 

1.  All  consecutive  steps  of  p  are  separated  by  c\; 

2.  All  consecutive  steps  of  q  are  separated  by  C2; 

3.  All  messages  from  q  to  p  are  delayed  by  time  0; 

4.  All  messages  from  p  to  q  are  delayed  by  at  least  time  d; 

5.  For  all  to,  there  exists  a  pair  of  messages  m\  and  sent  by  p  at  times  and  <2 

respectively,  such  that  ti  >  to,  I2  ~  t\  >  dj p,  and  no  message  is  sent  by  p  in  the 
interval  (<1,^2)- 

Proof:  For  contradiction,  suppose  not.  Fix  any  execution  0  of  the  protocol  in  which  (z)  the 
first  three  timing  constraints  are  satisfied,  {ii)  each  component  link  from  p  to  q  delays  each 
message  by  time  df  p,  and  (in')  no  processor  fails.  Such  an  execution  exists  because  conditions 
(i),  (ii)  and  (Hi)  are  independent  of  each  other  and  within  the  bounds  of  the  model.  Clearly, 
condition  (ii)  implies  that  the  fourth  timing  constraint  is  satisfied — all  messages  from  p  to 
q  are  delayed  at  leeist  time  d.  We  prove  that  the  fifth  condition  is  also  satisfied  in  0.  To  do 
so,  assume  for  contradiction  that  it  is  not. 
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First  note  that  because  /?  is  infinite,  p  must  send  an  infinite  number  of  messages:  if  it 
does  not,  then  let  mg  be  the  Icist  message  that  it  sends  and  consider  an  execution  7  in  which 
p  fails  after  sending  mg.  Because  q  receives  the  same  messages  from  p  in  each  execution, 
it  cannot  distinguish  between  the  two  executions  and  therefore  q  either  does  not  detect  p’s 
failure  in  7  or  erroneously  decides  that  p  has  failed  in  0. 

Recall  that  a  processor  can  send  messages  only  during  steps  and  p’s  steps  are  separated 
by  exactly  time  C\  in  /3.  It  follows  that  if  two  consecutive  messages  are  not  separated  by 
at  least  time  d/p,  then  they  are  separated  by  at  most  k  =  ^  —  1  steps,  which  is  time 
k  ■  Cl  <  d/p. 

Consider  the  interval  [to,  to  +  x]  of  execution  yd,  where  x  is  defined  below.  Because  p 
sends  an  infinite  number  of  messages  and,  by  assumption,  every  two  consecutive  messages 
are  separated  at  most  time  k  ■  ci,  process  p  sends  at  least  [x/(fc-ci)J  messages  in  this 
interval.  But  since  the  last  component  link  delays  each  message  by  d/p,  at  most  [5^] 
messages  are  delivered  in  this  interval.  Thus  the  number  of  messages  sent  but  not  delivered 
in  this  interval  is  at  least  —  1)  —  (^^  +  1).  According  to  the  properties  of  the  message 
links,  the  last  message  sent  in  this  interval  may  not  be  delivered  until  all  prior  messages  have 
been  delivered.  Thus  the  last  message  sent  by  p  in  this  interval  may  not  be  delivered  until 
time  to  +  x  +  j^{j^  —  -^—2).  Let  x  be  any  number  large  enough  so  that  ^  — 5^  — 2)  >  B 

(recall  that  k  •  ci  <  d/p). 

We  conclude  that  the  last  message  sent  by  p  in  the  interval  [to,  to  +  x]  of  is  not  delivered 
until  after  time  to  +  x  +  B.  Since  p  does  not  fail  in  /d,  q  does  not  time  out  p;  in  particular,  q 
does  not  time  out  p  before  time  to  +  x  +  B.  However,  before  time  to  +  x  +  B,  this  execution  is 
indistinguishable  to  q  from  an  execution  in  which  p  fails  at  time  to  +  x  and  which  is  otherwise 
identical  to  /?  at  p  and  q  up  to  times  to  +  x  and  to  +  x  +  B,  respectively.  Therefore  in  this 
execution  q  does  not  detect  the  failure  of  p  within  time  B.  This  is  a  contradiction  on  the 
assumed  protocol.  ■ 

Our  lower  bound  proof  uses  the  retiming  techniques  of  “shifting”  events  in  time  and 
“shrinking”  portions  of  executions  that  were  developed  in  [AL89]  and  [LL84]. 

Theorem  5.2  In  a  system  with  links  of  capacity  p  and  delay  d,  no  correct  timeout  protocol 
can  guarantee  failures  to  be  detected  within  less  than  time  min(2Cd  +  d/p,  C^d/p  +  Cd  +  d). 

Proof:  Let  r  =  min(2Cd  +  d/p,  C^d/p  +  Cd  +  d).  For  contradiction,  assume  the  existence 
of  a  protocol  that  guarantees  a  detection  time  of  T.  We  do  not  make  use  of  the  particular 
value  of  T  until  the  final  step  of  the  proof  (the  construction  of  execution  13").  We  will  reach 
a  contradiction  by  showing  that  there  is  an  execution  of  the  protocol  in  which  p  does  not 
fail  but  q  decides  that  it  has. 

Let  (3  be  an  execution  of  the  protocol  whose  existence  is  implied  by  Lemma  5.1  with 
^0  =  '^2  be  the  two  messages  specified  by  the  lemma,  sent  by  p 


4!) 


p 

ii  ■ 

tj  ■ 

(fast) 


Figure  5.1:  Execution  /3,  the  existence  of  which  is  proved  by  Lemma  5.1,  takes 
the  above  form  except  that  messages  from  p  to  q  may  be  delayed 
more  than  d  and  messages  may  be  sent  by  q  at  arbitrary  times.  The 
events  of  p  {q)  are  on  the  left  (right),  with  time  represented  by  the 
vertical  dimension.  An  arrow  represents  a  message  labelled  with  its 
delay,  with  its  tail  at  the  time  of  the  send  event  and  its  tip  at  the 
time  of  the  receive  event. 

at  times  ti  and  *2  respectively.  Figure  5.1  depicts  an  example  of  an  execution  satisfying 
Lemma  5.1;  for  presentation,  messages  from  p  to  q  are  shown  taking  exactly  time  d  and 
messages  from  q  to  p  are  shown  at  arbitrary  times. 

Let  a  be  an  execution  in  which  (i)  events  at  p  axe  identical  to  those  of  0  up  to  time  ti, 
(ii)  p  fails  at  time  ti  after  sending  mi,  and  (Hi)  events  at  q  are  identical  to  those  of  0  up  to 
time  ti  -h  d/p  -h  d.  Clearly  a  exists,  since  messages  from  p  to  q  are  delayed  by  at  least  time 
d  in  0  and  so  q  doesn’t  receive  m2  until  <2  +  d  >  ti  +  d/p  +  d.  Also,  the  assumed  protocol 
guarantees  that  in  q,  q  detects  the  failure  of  p  before  time  ti  +  T. 

The  rest  of  the  proof  proceeds  as  follows.  By  retiming  the  events  of  a  and  /?,  we  construct 
executions  q'  and  0\  which  are  indistinguishable  to  both  p  and  q  from  a  and  0  respectively. 
By  retiming  the  events  of  q',  we  construct  a",  which  is  indistinguishable  from  a'  to  q.  Finally, 
by  retiming  the  events  of  0\  we  construct  0",  which  is  indistinguishable  from  0'  to  p  up  to 
the  time  that  it  sends  m2  and  indistinguishable  from  q"  to  q  up  to  the  time  that  it  times 
out  p  in  a"  Thus,  although  p  does  not  fail  in  0",  q  times  out  p  in  0'\  contradicting  the 
correctness  of  the  tissumed  protocol. 
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•  ti+T-d 

Figure  5.2;  In  the  region  of  interest,  execution  a'  is  simply  a  with  events  of 
processor  q  occurring  earlier  in  time  by  d.  Because  p  fails  at  time 
ti,  q  detects  the  failure  of  p  by  time  ti  +  T  —  d,  denoted  by  the 
circle. 

Construction  of  executions  a'  and  /3' 

Conceptually,  we  wish  to  construct  q'  trom  a  by  letting  each  event  at  q  occur  earlier  in  time 
by  d  (“shifting”  those  events  earlier  by  d).  Strictly  speaking,  this  may  not  be  possible  for 
all  events  at  q  because  of  initial  conditions.  However,  it  is  sufficient  to  shift  by  d  the  events 
of  q  that  occur  after  time  <i  in  a  and  “shrink”  some  interval  before  in  q  (i.e.,  retiming 
the  events  of  the  inverval  so  that  q  runs  fast  in  that  interval  of  events  in  a').  In  particular, 
we  shrink  the  interval  [0,  Note  that  by  our  choice  of  to  =  choosing  /3,  the 

last  event  of  this  interval  occurs  before  mi  is  sent  at  time  t].  Leaving  all  events  at  time  0 
unchanged,  steps  of  q  in  this  interval,  which  take  time  C2  in  a,  are  retimed  to  take  time  ci 
in  q'.  Thus  the  interval  is  shrunk  by  a  factor  of  C  and  the  last  event  of  the  interval  occurs 
earlier  in  a'  by  —  Q^d  =  d.  Figure  5.2  depicts  the  suffix  of  o',  showing  the  shifted 
events  of  the  region  in  which  we  shall  be  interested. 

This  execution  satisfies  the  timing  constraints  on  message  delivery,  since  messages  sent 
by  p  (delayed  by  at  least  d  in  a)  are  received  by  q  at  most  d  earlier  in  a'  and  hence  are 
delayed  by  at  least  0  in  a';  messages  sent  by  q  (delayed  by  0  in  a)  are  sent  at  most  d  earlier 
in  a'  and  hence  are  delayed  by  at  most  d  in  a' . 

Execution  /?'  is  constructed  similarly,  shifting  earlier  by  d  the  events  at  q  in  3. 
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Because  p  and  q  do  not  know  the  time  between  any  particular  pair  of  steps  they  take, 
they  cannot  distinguish  between  either  a  and  a'  or  i3  and  f3'.  It  follows  that  a'  and  j3'  are 
not  distinguishable  to  p  up  to  the  point  at  which  it  fails  and  not  distinguishable  to  q  up 
to  when  it  receives  m2  in  /?'  (at  least  time  t2  >  ti  +  d/p).  Also,  q's  detection  of  p's  failure 
occurs  before  time  ti  T  —  d  in  a'. 


>  (fast) 


Figure  5.3:  Execution  a"  is  constructed  from  a'  by  mapping  the  interval  [ti  — 
(d  —  d/p),  +  (r  —  d)]  of  a' to  the  interval  [ti  — ^{d  — d/ p),  ti  + 

^{T  —  d)]  of  a"  and  appropriately  shifting  the  rest  of  q's  events. 


Construction  of  execution  a" 

Recall  that  q  runs  slowly  in  q  and  most  of  a' — its  steps  are  separated  by  C2.  We  now 
construct  a"  from  a'  by  a  retiming  certain  events  at  q.  Events  at  p  are  the  same  as  in  a'  up 
to  time  ti,  when  p  fails  in  both  executions;  after  time  ti,  the  events  (message  deliveries)  at 
p  may  be  defined  arbitrarily  within  the  bounds  of  the  model. 

The  retiming  operation  at  q  maps  the  interval  [<i  —  (d  —  d/p),  +  (T  —  d)]  of  a'  to 

the  interval  [ti  —  ^{d  —  d/p),  t\  +  ^.{T  -  d)]  of  a"  by  letting  q  run  fast  over  this  interval 
in  a".  Events  at  time  ti  in  q'  also  occur  at  in  q";  events  in  the  above  interval  of  a'  are 
retimed  to  occur  closer  to  time  by  a  factor  of  C.  The  rest  of  execution  a' — before  time 
t\  —  {d  —  d/p)  and  after  time  t  +  {T  —  d) — is  shifted  merely  to  preserve  the  step  times  of 
events  on  the  borders  of  this  interval.  To  be  precise,  a"  is  defined  at  q  for  by  retiming  each 
event  that  occurs  at  q  at  time  t'  >  ^^d  in  u'  to  occur  at  q  at  time  t"  in  a",  where  t"  is 
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defined  as  follows: 


t' +  {d  —  dl f.i)  —  ^(d  —  d/ 1.1)  if  -^-[d  <  t' <  ti  —  {d  —  d/ jj.) 

t"  =  <  —  ti)  if  ti  —  (d  —  d/fi)  <  t'  <  ti  +  {T  —  d) 

t'-{T-d)  +  i(T  -  d)  if  t'>ti  +  iT-  d) 

This  execution  is  illustrated  in  Figure  5.3. 

Again,  we  need  to  shift  the  events  before  time  —  (d  —  d/^)  while  preserving 
initial  conditions.  To  do  this  we  partially  undo  the  shrinking  performed  on  the  interval 
[O’  c-T^  of  a.  These  events  were  mapped  to  the  interval  [0,  which  q  runs 

fast,  with  the  last  event  of  the  interval  occurring  exactly  time  d  earlier  in  a'  than  in  q. 

In  a",  we  need  the  last  event  of  this  interval  to  occur  exactly  time  (d-d//i)  — ;^(d  — d//x) 
later  than  in  a'.  Because  this  amount  is  less  than  d,  we  zire  able  to  do  this,  in  effect 
partially  undoing  the  original  shrinking.  The  timing  assumptions  for  steps  of  q  are 
clearly  satisfied.  Because  the  net  effect  from  both  shrinking  operations  is  to  shift  any 
particular  event  in  the  interval  [0,^^d]  of  a  earlier  by  less  than  d  in  a",  the  timing 
assumptions  for  message  delivery  are  also  clearly  satisfied,  for  the  reasons  outlined  in 
the  discussion  of  a'. 

By  construction,  this  retiming  operation  does  not  cause  violations  of  the  bounds  on 
process  step  times.  We  now  verify  that  a"  is  consistent  with  the  timing  assumptions  for 
message  delivery.  First  note  that  all  events  at  p  before  time  ti  occur  at  the  same  time  in 
executions  /3,  q,  a'  and  a”.  We  show  that  for  any  e’’ent  at  q  occurring  at  time  t"  in  a"  and 
at  time  f  in  a  (and  hence  at  t'  =  t  —  d  in  a')  such  that  t"  <  ti  and  t  >  we  have 

t  —  d  <  t"  <  t.  By  the  retiming  mapping  above,  as  a  function  of  t'  we  have 

t'  <  t"  <  t' +  {d-dlp)-^{d-dlp), 

(this  is  because  t'  is  mapped  forward  in  time  furthest  \yhen  t'  <  ti  —  {d  —  d/p);  least  when 
t'  =  ti)  which,  substituting  t'  =  t  —  d,  gives 

t  —  d  <  t"  <  t  —  dj p  —  ^{d  —  dj p)  <  t.  (5-1) 

In  a,  every  message  from  p  to  g  is  delayed  by  at  least  d.  We  claim  that  in  q",  every 
message  from  p  to  9  is  delayed  by  at  leeist  time  0  and  by  less  time  than  in  a.  If  a  message  is 
delivered  at  q  after  time  <1  in  q",  then  because  p  sends  no  messages  after  time  fi,  it  must  be 
sent  by  ti  (no  new  message  receipts  at  q  have  been  introduced  to  a")  and  hence  delayed  at 
least  time  0;  also,  events  at  q  after  time  ti  in  a"  occur  earlier  in  a"  than  in  oc,  so  the  message 
is  delayed  by  less  than  it  is  in  a.  If  a  message  is  delivered  at  q  at  time  t"  <  ti  in  a"  then  by 
Equation  5.1,  it  is  delivered  earlier  in  a"  than  in  q  by  not  more  than  d;  it  follows  that  the 
message  is  delayed  by  at  least  time  0  in  a". 


We  also  claim  that  the  delay  of  each  message  from  ^  to  p  in  a"  is  delayed  by  at  least  0 
and  at  most  d.  In  a,  all  messages  from  q  to  p  are  delayed  0;  if  in  a"  they  are  sent  before  ^i, 
then  from  Equation  5.1  they  are  sent  earlier  (and  delayed  more)  by  not  more  than  d.  Any 
message  sent  by  q  after  ti  is  defined  arbitrarily  to  be  within  the  bounds  of  the  model. 
Finally,  we  note  that  q  detects  the  failure  of  p  before  time  ti  +  ^{T  —  d)  in  a". 

Construction  of  execution 

We  now  construct  execution  0"  in  which  p  does  not  fail  and  which  is  indistinguishable  to  q 
from  a"  up  to  time  ti  +  -^{T-d).  In  proving  that  satisfies  the  timing  assumptions  on  step 
time  and  message  delivery,  we  will  make  use  of  the  fact  that  T  =  min(2Cd  4-  d//i,  C^dj p  + 
Cd+d).  Because  q  times  out  p  before  time  U  +  ^{T  -  d)  in  a",  we  conclude  that  in  0",  q 
mistakenly  times  out  the  nonfaulty  p,  contradicting  the  assumed  correctness  and  completing 
the  proof. 

To  construct  0"  at  q  we  use  exactly  the  same  events  as  in  a",  up  to  time  +  ^{T  —  d). 
We  do  not  specify  the  events  occurring  at  q  later  than  this  except  to  say  that  any  message 
sent  by  p  after  time  ti  is  received  at  q  at  least  time  d  later. 


Figure  5.4:  Execution  0"  is  essentially  the  same  as  execution  a",  except  that  p 
does  not  fail;  instead,  it  runs  slowly  after  sending  message  mi,  and 
message  m2  is  delayed  by  d.  Because  p  sends  no  other  messages 
before  m2,  this  execution  appears  the  same  as  a"  to  q  until  it 


receives  m2. 


At  p,  we  construct  0"  from  0'  by  mapping  the  interval  [ti,  ti  +  d/fi]  of  0'  to  the  interval 
^1  +  ^{T  —  <i)  —  c?]  of  0"  (p  runs  fast  over  this  inteval  in  0'\  it  runs  more  slowly  over 
this  interval  in  0").  Events  in  this  interval  are  retimed  to  occur  further  from  time  ti  by  at 
most  a  factor  of  C  (as  we  will  show).  We  do  not  specify  events  occurring  at  p  after  time 
t\  4-  ^{T  —  d)  —  d  except  to  say  that  any  message  sent  by  q  after  time  ti  —  -^{d  —  d/fi)  is 
delivered  at  p  exactly  time  d  later.  Thus,  0"  is  defined  at  p  for  t"  <  ti  +  ^{T  ~  d)  —  d  by 
retiming  each  each  event  that  occurs  at  time  t'  in  0'  to  occur  at  time  t"  in  0",  where  t"  is 
defined  as  follows: 


f 

h  + 


^(T-d)-d 

dju 


{f-U) 


if  t'  <  ti 

if  ti  ^  ^  4"  df y. 


This  execution  is  illustrated  in  Figure  5.4. 

We  now  verify  that  0"  is  consistent  with  the  timing  assumptions  of  the  model.  Note  that 
all  events  of  0"  at  p  before  time  are  the  same  in  0\  0,  a,  a\  and  a"\  events  of  0"  at  q  before 
time  ti  4-  ^{T  —  d)  are  by  definition  the  same  as  in  a" .  Having  already  verified  the  timing 
properties  for  a",  we  need  verify  only  the  timing  properties  involving  events  (processor  steps, 
message  sends,  and  message  receipts)  occurring  at  p  in  the  interval  4-  —  d)  —  <f]. 

Events  occurring  at  p  later  than  ti  4-  ^{T  —  d)  —  d  and  at  q  later  than  t\  4-  ^{T  —  d)  are 
inconsequential  to  the  proof  and  may  be  scheduled  in  any  way  consistent  with  the  bounds 
of  the  model. 

First,  we  verify  that  successive  steps  of  p  after  t\  are  separated  by  at  most  ca.  We  show 
that  for  any  interval  of  0'\  mapped  from  the  interval  [f,,  fj]  of  0,  where  t\  <ti  <tj  < 

ti  4-  d/p,  we  have  f'  —  t'-  <  C{tj  —  ti): 

MT-d)-d 


t"  -  t'!  = 

''j 


< 


[tj  -  <.)- 


d/p 


d 


dly 


It  follows  that  because  any  two  steps  of  p  are  separated  by  time  ci  in  0',  they  are  separated 
by  at  most  C  •  C2  =  Ci  in  0". 

We  now  verify  that  messages  sent  by  p  after  ti  are  within  the  proper  bounds.  The  first 
message  sent  by  p  after  is  m2,  which  in  0'  (and  0)  is  sent  at  (2  >  ti dj y  and  thus  in  0" 
is  sent  at  4-  ^(T  —  d)  —  d.  Messages  sent  by  p  after  time  ti  are  specified  to  be  delayed 

by  at  least  time  d,  so  m2  is  not  delivered  until  at  least  time  tx  4-  ^{T  —  d).  (Note  that  q 
times  out  p  by  this  time.)  The  delivery  of  m2  and  all  subsequent  messages  by  p  is  consistent 
with  our  definition  of  0"  at  q. 

We  now  verify  that  messages  from  q  to  p  are  within  the  proper  bounds.  We  analyze  these 
messages  in  three  cases  according  to  when  they  are  sent  by  q  in  execution  0'  (which  is  the 
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same  time  as  they  are  sent  in  a'). 

Case  I ;  q  sends  at  time  t'  <  t\  —  d 'm  l3\ 

A  message  sent  by  q  at  time  ti  —  d  in  l5'  is  sent  and  delivered  at  time  ti  in  3.  Since  the  events 
at  p  are  the  same  in  /3"  and  ^  and  a",  messages  sent  by  q  before  ti  —  d  in  jd"  are  delivered 
to  p  by  ti  in  3"  and  are  therefore,  by  the  analysis  of  a",  are  delayed  by  correct  amounts. 

Case  2:  q  sends  at  time  t'  in  /3'  where  t\  —  d  <t'  <  tx  ~  {d  —  dfp). 

Such  a  message  is  delivered  at  time  t'  +  d  in  3'  where  ti  <  t'  +  d  <  ti  +  df p.  In  3"  the 
sending  event  at  q  is  mapped  to  time  t'  +  (d  —  d/p)  —  ^{d  —  d/p),  which  is  less  than  ti.  In 
3"  the  delivery  event  at  p  is  mapped  to  tx  +  —  d)  —  d){t'  +  d  —  tj),  which  is  greater 

than  tx.  Thus,  this  messages  is  delayed  by  at  least  0.  We  show  below  that  it  is  delayed  by 
at  most  d: 


‘‘(f +d-h)-  [<' + (<i - d/t,) - - d/ii) 


—  *1  + 


UT  -d)-d 


-  1  U'  + 


UT-d)-d 


\(d-u)-\^(d-dlii)-^(d-dl^)]^ 


,  h{T-d)-d 
S  ti  H - T, - 


(/i  -  (d  -  d/p)  +d-tx)-  tx-[d-  d/p)  +  (d  -  d/ p)  -  ^(d  -  d/p) 


-  Ij  >  0 

<  tj  +  l(T-d)-d-[tx-i(d-d/^)' 

<  ^(2Cd  +  d/ p  —  d)  —  d  +  ^(d  —  d/p)  since  T  <  2Cd  +  d/ p 

G  G 

=  d. 


since  t'  <  tx  —  {d  —  d/p)  and 


UT-d) 

d/p 


Case  5;  q  sends  at  time  t'  >  tx  —  {d  —  d/p)  in  3'- 
These  messages  are  sent  at  t"  >  tx  —  ^(d  —  d/p)  in  3"  and  thus  are  defined  to  be  delivered 
at  p  exactly  time  d  later.  Note  that  such  messages  are  delivered  at  p  later  than  time 

tx  — -^d  +  ~d/ p  +  d  =  tx +  2d  —d/p  — —d  —  d 

=  t,  +  ^{2Cd  +  d/d-d)-d 
>  l,  +  ^{T-d)-d. 

This  is  consistent  with  our  definition  of  3"  at  p. 

Thus  we  conclude  that  3"  is  a  valid  timed  execution  in  which  p  does  not  fail  but  q  times 
out  p.  This  is  a  contradiction  on  the  correctness  of  the  assumed  protocol.  ■ 
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5.2.3  Bounds  for  two  processors  using  a  single  message  link 


We  remark  that  our  techniques  give  a  tight  upper  and  lower  bound  of  C^d//i  +  Cd  +  d  for  a 
system  of  two  processors  with  a  message  link  in  only  one  direction. 

In  such  a  system,  we  have  two  processors,  p  and  q,  and  a  single  message  link  of  capacity 
p  from  p  to  q.  Naturally,  a  protocol  does  not  need  to  detect  failures  of  q.  All  other  previous 
definitions  apply. 

The  second  simple  protocol  described  in  Section  5.2.1  operates  independently  in  each 
direction.  It  immediately  gives  a  protocol  for  the  unidirectional  case,  guaranteeing  that  in 
any  execution,  q  detects  the  failure  of  p  within  time  C^df  u  +  Cd  +  d. 

It  is  also  not  difficult  to  see  that  our  lower  bound  proof  of  Theorem  5.2  specializes  to  the 
unidirectional  case  to  give  a  corresponding  lower  bound  of  C^d/ p  +  Cd  +  d.  Theorem  5.2  is 
proved  for  T  =  min(2Cd  +  d/ p,  C^d/ p  -t-  Cd  +  d).  A  similar  theorem  for  the  unidirectional 
case  may  be  proved  with  T  =  C^dj p  +  Cd  -I-  d.  Recall  that  in  that  proof,  the  value  of  the 
timeout  detection  time  T  guarajiteed  by  the  protocol  is  not  used  before  the  claims  about 
execution  0".  All  preceding  claims  except  those  involving  messages  from  g  to  p  carry  over  a 
fortiori.  Lemma  5.1,  for  example,  is  true  also  for  the  unidirectional  case  with  the  exception 
of  its  third  condition,  which  regards  messages  from  q  to  p.  The  proof  of  Theorem  5.2  uses 
the  fact  that  T  <  2Cd  +  d/p  in  claims  about  0'*  only  to  verify  bounds  on  the  delay  of 
messages  from  q  to  p.  This  analysis  is  not  needed  for  a  theorem  about  the  unidirectional 
case  and  hence  the  entire  proof  specializes  to  the  unidirectional  Ccise  to  give  a  lower  bound 
of  C^d/p  +  Cd  +  d. 
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5.3  Consensus  with  bounded-capacity  message  links 


VVe  remark  on  how  our  upper  bounds  for  consensus  are  affected  by  bounding  the  capacity  of 
the  message  links  used. 


5.3.1  Byzantine  failures 

Because  it  is  not  message-intensive,  our  algorithm  for  Byzantine  failures  is  not  affected  by 
the  restriction  of  bounded-capacity  message  links.  Recall  that  the  algorithm  for  Byzantine 
failures  does  not  include  a  fault-detection  task  and  does  not  require  a  process  to  send  a 
message  at  every  step  it  takes.  The  correctness  follows  from  the  fact  that  at  least  time 
2d  passes  between  the  time  that  the  round  r  —  1  messages  of  all  nonfaulty  processes  are 
delivered  and  the  time  that  any  nonfaulty  process  advances  to  round  r  -I-  1.  The  round  r 
message  of  a  nonfaulty  process  can  incur  more  than  delay  d  only  if  it  is  sent  before  the 
previous  message  is  delivered.  The  previous  message  is  its  round  r  —  1  message,  so  even  if 
the  round  r  message  incurs  added  delay,  it  is  still  delivered  by  time  2d  (actually,  d)  adter 
the  round  r  —  1  messages  of  all  nonfaulty  processes  are  delivered,  and  all  nonfaulty  processes 
receive  it  before  advancing  to  round  r  -|-  1.  Otherwise,  if  the  round  r  message  does  not 
incur  added  delay  due  to  the  capacity  of  the  message  link,  the  proof  of  Lemma  4.2  holds  as 
before.  Because  nonfaulty  processes  do  not  send  any  messages  other  than  the  messages  of 
the  synchronous  aigorithm,  it  is  easy  to  see  that  the  delay  of  messages  does  not  cdfect  the 
proof  of  Lemma  4.3,  and  the  running  time  is  not  affected. 


5.3.2  Omission  failures 

First  note  that  if  every  pair  of  messages  sent  by  a  process  are  separated  by  at  least  time 
then  each  message  is  delayed  by  at  most  time  d  and  the  omissions  algorithm  is  not 
affected  (the  analysis  of  Chapter  3  holds).  Because  the  fault-detection  protocol  requires  a 
process  to  send  a  message  at  every  step  it  takes,  messages  may  be  separated  by  as  little  as 
time  Ci;  therefore  the  omissions  aigorithm  is  not  affected  if  the  message  links  are  of  capacity 
>  d/ci. 

The  first  consequence  of  using  links  of  capacity  ft  <  d/c\  is  the  obvious  effect  on  the 
fault-detection  protocol.  Instead  of  the  bound  Cd  -t-  d  guaranteed  for  the  time  until  a  failure 
is  detected  (shown  in  Lemma  3.1),  a  bound  of  only  min(2Cd  -I-  d//i,  C^d/fi  -I-  Cd  -I-  d)  can 
be  guaranteed  by  the  fault-detection  protocol.  Lemmas  3.7  and  3.18  then  also  involve  the 
above  expression  instead  of  Cd  +  d. 
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But  a  more  serious  affect  on  the  running  time  of  the  algorithm  is  the  added  delay  between 
when  a  process  “should”  send  a  message  (according  to  the  main  algorithm)  and  when  it  may 
send  it.  A  crucial  element  of  the  algorithm  is  the  pi^^gybacking  of  messages  of  fault-detection 
task  and  messages  of  the  main  algorithm.  A  straightforward  implementation  would  require 
that  messages  of  the  main  algorithm  can  only  be  sent  during  steps  in  which  a  message  of 
the  fault-detection  task  is  to  be  sent. 

If  the  first  timeout  task,  in  which  each  pair  of  processes  continually  sends  a  “token”  back 
and  forth,  were  used  for  the  fault-detection  protocol,  up  to  time  2d  may  elapse  between 
when  a  process  is  required  to  send  a  message  of  the  main  algorithm  and  when  it  is  able  to 
piggyback  that  message  onto  a  message  of  the  fault-detection  task.  Thus  each  message  may 
in  effect  be  delayed  by  a  total  of  3d,  since  the  timeout  task  ensures  that  all  messages  are 
delivered  within  time  d  of  when  they  are  sent,  despite  the  capacity  of  the  message  links. 
This  gives  bounds  of 


4(/-|- l)3d-l-2Cd  for  n  >  2/ -|- 1 
(3;^ +5)(/-|- l)3d-|-2Cd  for  n  <  2/. 

The  bound  of  Section  3.5.2  can  be  slightly  modified  to  give  a  bound  of 

(2v/2C-l-6)(/  + l)3d-|-2Cd  for  n  <  2/. 

Using  the  second  timeout  task,  in  which  a  process  waits  for  d/{^ci)  steps  between  every 
pair  of  messages,  adds  a  delay  of  up  to  C{d/fi)  to  every  message.  Each  message  is  then  in 
effect  be  delayed  by  a  total  of  up  to  d(l  -I-  Cffi).  This  gives  bounds  of 

A{f  -\-l){l+C/^l)d  +  {C^d/n  +  Cd)  for  n>2/-|-l 

and 

(3^+m  +  l){l  +  C/,,)d  +  l,C^d/^  +  Cd) 
and(2C/v/?  +  2yC  +  6)(/+l)<i  +  (CV/#.  +  C<;)  "  "--J- 

It  may  be  possible  that  a  more  clever  stategy  would  allow  processes  to  send  messages 
of  the  algorithm  on  demand  by  more  closely  intertwining  the  main  algorithm  and  the  fault- 
detection  tcisk. 
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Chapter  6 
Conclusions 


We  first  summarize  the  known  bounds  for  consensus: 


Failure  type 

n  > 

Lower  bound 

Upper  bound 

Reference 

Stopping 

f  +  l 

{f-l)d  +  Cd, 

{f  +  l)d 

2{f+l)d  +  Cd 

[ADLS90] 

Omissions  (sending) 

2f  +  l 

T 

4(/  l)d  Cd 

Thm.  3.21 

/  +  1 

T 

+  5)(/  +  l)d  +  Cd, 
{2^  +  &){f  +  l)d  +  Cd 

Thm.  3.25 
Thm.  3.28 

“Timing” 

/  +  1 

T 

{f  +  l){Cd  +  d) 

(see  below) 

Byzantine 

3/ +  1 

T 

{f  +  l){d  +  2Cd) 

Thm.  4.4 

Auth.  Byzantine 

2f  +  l 

T 

{f+l){d  +  2Cd) 

(see  below) 

(1 

n-f  ^  .n-/. 

)Cd 

(C/+1  +  +  . . .  +  C)d 

(see  below) 

The  bounds  for  stopping  and  omission  failures  (for  n  >  2/  +  1)  are  tight  to  within  approx¬ 
imately  a  constant  factor  (2  and  4,  respectively).  The  bounds  for  omission  failures  when 
n  <  2/  axe  not  tight;  an  improvement  in  either  direction  would  be  interesting.  It  has  been 
noted  by  Bharali  ([B91])  that  the  running  time  for  omissions  failures  can  be  improved  to 
3(/  +  1)</  -I-  Cd  by  the  following  modification  to  the  algorithm.  The  improvement  is  ob¬ 
tained  by  reducing  the  delay  caused  by  a  process  that  must  wait  for  acknowledgments  before 

°This  is  an  updated  version  of  the  original  Chapter  6.  It  differs  by  the  inclusion  of  the  following;  the 
upper  bound  for  authenticated  Byzantine  failures  when  n  >  2f  +  I,  the  improvement  of  the  constant  from 
4  to  3  for  the  running  time  of  the  omissions  algorithm  ([B91]),  and  the  more  careful  analysis  of  the  running 
time  in  the  model  of  [HK89]. 
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sending  an  r  message.  Recall  that  if  p,  sends  an  r  —  1  message  at  exactly  — the  latest 
possible  time — and  immediately  thereafter  receives  an  r  message,  it  may  have  to  wait  until 
time  tr-\  +  ‘2d  to  receive  enough  acknowledgments  for  its  /•  —  1  message  before  sending  an 
r  message  and  advancing  to  phcise  r  +  1.  Thus  a  process  pj  receiving  the  r  message  from  pi 
would  not  receive  it  until  +  3d.  The  idea  is  to  let  p,  send  a  “virtual  r  message”,  even 
though  it  heis  not  yet  received  n  —  f  acknowledgments  for  its  r  —  1  message.  Process  pj 
does  not  treat  a  virtual  r  message  from  pi  as  a.  regular  r  message  until  it  sees  that  p,  hais 
received  enough  acknowledgments  for  its  r  —  1  message  (recall  that  all  messages,  including 
acknowledgments  are  broadcast  to  all  processes).  Thus,  if  p,  does  get  enough  acknowledg¬ 
ments,  then  both  p,  and  pj  receive  them  by  time  4_i  +  2d  and  pj  has  effectively  received  a 
(real)  r  message  from  pi  by  time  tr-i  +  2d,  saving  time  d. 

For  failures  less  benign  than  omissions,  this  thesis  leaves  open  a  large  gap  in  time  com¬ 
plexity.  In  particular,  the  following  central  question  remains  unanswered: 

Does  consensus  in  the  presence  of  Byzantine  failures  require  time  D(/Cd)? 

The  difficulty  of  this  problem  seems  to  lie  not  in  the  potential  of  for  arbitrary  message 
content  but  in  the  potential  for  timing  misbehavior.  We  believe  an  important  step  towards 
answering  this  question  will  be  to  obtain  tight  bounds  for  “timing  failures”,  described  below. 

Timing  failures 

We  say  that  a  process  suffers  a  “timing  failure”  if  the  time  between  some  pair  of  successive 
steps  is  not  in  the  interval  [ci,C2].  The  simple  direct  rounds  simulation  first  described  in 
Chapter  3  tolerates  timing  failures  as  well,  implying  a  consensus  algorithm  with  running 
time  (/  -I-  l){Cd  +  d).  The  algorithm  of  [ADLS90]  is  also  correct  despite  timing  failures,  but 
each  of  its  phases  may  take  up  to  time  Cd  d.  In  fact,  no  algorithm  is  known  to  tolerate 
timing  failures  in  less  than  time  0{fCd). 

Byzantine  failures  with  authentication 

First  note  that  the  direct  simulations  outlined  at  the  beginning  of  Chapter  3  do  not  work  in 
the  presence  of  Byzantine  failures,  even  with  authentication,  and  our  general  simulation  of 
Chapter  4  itself  requires  n  >  3/  -I-  1. 

The  upper  bound  for  authenticated  Byzantine  failures  with  n  >  '2f+ 1  can  be  obtained  by 
a  very  simple  modification  of  the  algorithm  for  Byzantine  failures:  change  “If  |V'’’|  >  2f  +  1” 
to  “relay  1/’’+'”  (unconditionally).  In  other  words,  to  ensure  that  every  process  receives  f  +  l 
round  r  messages  (and  therefore  sends  its  own  round  r  message),  it  is  sufficient  to  relay  the 
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/  + 1  round  r  messages  already  received — these  messages  are  signed  and  therefore  believable. 
This  protocol  works  for  n  >  2f  +  I  and  achieves  the  same  time  complexity. 

When  n  <  2/,  the  only  obvious  algorithm  to  tolerate  authenticated  Byzantine  failures 
is  a  costly  simulation  of  a  synchronous  algorithm.  The  simulation  requires  that  processses 
begin  synchronized  and  time  out  each  other’s  timeouts  by  terminating  round  i  after  {C'~^  + 
•  •  •  +  C  +  l)dlci  steps. 

The  lower  bound  for  authenticated  Byzantine  failures,  not  presented  in  this  thesis,  is 
interesting  (greater  than  Cd)  only  for  the  limited  range  of  n  <  2/,  and  therefore  says 
nothing  interesting  about  unauthenticated  Byzantine  failures.  The  proof  of  this  bound  is 
similar  to  the  “shifting  scenarios”  proofs  of  [FLM86]. 

Before  suggesting  other  directions  for  further  research,  we  first  comment  on  the  implica¬ 
tions  of  our  bounds  for  consensus  in  a  closely  related  model. 


6.1  Consensus  in  the  related  model  of  [HK89] 

Herzberg  and  Kutten  [HK89]  consider  a  model  in  which  the  actual  worst-case  message  delay 
in  a  given  execution,  S,  may  be  much  less  than  the  a  priori  worst-case  bound  on  message 
delay,  A.  It  is  thus  desirable  for  the  running  time  of  algorithms  to  depend  minimally  on  A. 
This  model  raises  similar  concerns  as  our  model  does;  in  particular,  detecting  the  absence  of 
a  message  may  be  much  more  expensive  than  receiving  the  message. 

For  the  consensus  problem,  it  is  not  difficult  to  see  that  direct  implementation  of  syn¬ 
chronous  algorithms  gives  a  running  time  of  0(/A)  for  any  type  of  failures;  on  the  other 
hand,  clearly  the  synchronous  lower  bound  implies  that  no  algorithm  can  guarantee  a  running 
time  of  less  than  (/  -t-  1)6.  In  this  model,  our  algorithms  yield  aji  improvement  over  direct 
simulation  strategies  similar  to  the  corresponding  improvement  in  our  semi-synchronous 
model. 

It  is  not  difficult  to  see  that  our  algorithms  may  be  run  without  modification  in  the  model 
of  [HK89],  yielding  the  same  running  times  with  the  syntactic  substitution  of  A  for  Cd  and 
S  for  d.  Thus  we  obtain  bounds  of 

4(/ -I- 1)6 -1- A  for  n>2/-fl 
(3;:^  +  5)(/  -h  1  )6  -t-  A  for  n  <  2/. 

The  bound  involving  y/C  carries  over  with  yA/6  instead  of  \/C,  giving  a  bound  of  (2y'^A/6-|- 
6)(/  -f  1)6  -4-  A.  If  6  <C  A,  these  are  significant  improvements  over  the  bounds  obtainable 
by  direct  simulation.  Moreover,  it  is  possible  to  prove  better  bound  for  these  algorithms 
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in  general,  depending  on  the  ratio  of  A  to  <5;  the  bound  of  4(/  +  l)d  +  Cd  is  realized  by 
an  execution  only  if  A  =  AS.  This  is  because  process  clocks  are  perfectly  synchronized: 
whereas  in  our  model  the  time  between  failure  and  detection  may  be  anywhere  in  the  range 
[d,  d+  Cd],  in  this  model  it  must  be  A  (plus  or  minus  twice  the  step  time,  which  is  assumed 
to  be  much  less  than  6;  see  [ADLS90],  §7).)  The  length  of  each  phase  except  the  last  must 
therefore  be  A  and  must  have  at  least  A/(5  -  3  failures  (processes  in  Br).  There  are  thus  at 
most  (/  +  l)/(A/6  —  3)  phases  except  for  the  last  phase,  and  the  running  time  is  at  most 
( A^)(/  +  1)^  +  The  running  time  is  the  maximum  of  this  expression  and  4(/  +  1)<5  + A. 
Similarly  for  the  stopping  failures  algorithm  of  [ADLS90],  the  running  time  is  the  maximum 
of  2{f  +  l)d)  +  Cd  and  (^^)(/  +  1)<5  +  A.  Our  algorithm  for  Byzantine  failures  is  not 
interesting  in  this  model,  as  it  is  trivial  to  design  an  algorithm  taking  only  time  (/  +  1)A 
(our  algorithm  takes  (2/  +  1)A  +  fS). 

In  comparison,  the  algorithms  of  [DLS88]  may  also  be  used  in  the  model  of  [HK89].  For 
stopping  and  omission  failures  (sending  and  receiving),  their  algorithms  require  n  >  2/  +  1; 
for  Byzantine  failures,  they  require  n  >  3/  +  1.  Their  algorithms  assume  only  that  an  upper 
bound  on  message  delay  time  exists — it  may  not  be  known  to  the  processes;  the  running  time 
is  a  function  of  the  maximum  message  delay  in  the  given  execution.  The  running  times  are 
0{S'^  +  n^)  for  all  types  failures.  (As  noted  in  [DLS88],  the  running  times  can  be  improved  to 
0(S^  +  /^).  VVe  also  note  that  the  different  model  considered  there,  which  enables  processes 
to  send  to  at  most  one  process  per  step,  does  not  affect  the  time  bound  asymptotically.) 

Note  that  in  contrast  to  our  algorithm  ajid  the  algorithm  of  [ADLS90],  the  running  times 
of  the  algorithms  of  [DLS88]  in  the  [HK89]  model  do  not  depend  at  all  on  A,  the  upper  bound 
on  message  delay  time.  This  is  possible  because  the  model  of  [HKS9]  provides  an  extra  degree 
of  power  to  the  algorithm  by  assuming  that  process  clocks  are  perfectly  synchronized.  The 
algorithms  of  [DLS88]  do  not  give  good  bounds  in  our  model;  the  running  times  depend  only 
polynomially  on  the  ratio  of  process  step  rate,  C.  This  difference  in  the  model  also  accounts 
for  the  simplicity  of  solving  consensus  in  the  presence  'of  Byzantine  failures  in  the  [HK89] 
model,  relative  to  our  semi-synchronous  model. 


6.2  Directions  for  further  research 

There  are  many  possible  directions  for  interesting  research  addressing  the  issues  and  concerns 
of  real-time  behavior  of  distributed  systems; 

•  The  existence  of  the  underlying  synchronous  algorithm  described  in  Section  3.2  suggests 
that  the  results  of  [ADLS90]  and  this  thesis  may  be  generalizable  to  certain  classes  of 
synchronous  algorithms.  For  instance,  the  properties  of  the  underlying  synchronous 
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algorithm  that  make  it  amenable  to  “efficient”  simulation  in  our  model  are  that  it  is 
“early-stopping”  and  that  processes  advance  to  further  rounds  only  because  of  messages 
received  (not  because  of  messages  omitted). 

—  Can  the  properties  sufficient  for  efficient  simulation  be  clearly  characterized? 

—  Can  these  properties  be  shown  necessary  by  proving  lower  bounds  with  large 
dependency  on  C  for  synchronous  algorithms  lacking  these  properties? 

-  Are  the  factors  of  and  '/C  inherent  to  such  simulations  with  n  <  2/  and 
tolerating  omission  failures? 

•  What  classes  of  problems  are  in  fact  affected  by  timing  uncertainty?  Perhaps  problems 
solvable  in  asynchronous  systems  need  not  be  affected.  Can  they  be  helped  by  timing 
assumptions?  Are  only  fault- tolerant  problems  affected? 

•  Similax  questions  can  be  asked  in  the  context  of  the  model  of  Herzberg  and  Kutten 
([HK89]):  What  can  be  said  about  converting  synchronous  algorithms  with  running 
times  as  a  function  of  message  delay  d  to  algorithms  that  depend  on  the  actual  worst- 
case  message  delay  S  rather  than  the  a  priori  worst-case  message  delay  A? 

•  What  can  be  said  about  simulating  synchronous  algorithms  that  do  not  operate  in 
rounds? 

•  Other  work  ([SDC90])  on  the  real-time  complexity  of  the  consensus  problem  assumes 
a  different  model  of  semi-synchrony.  There,  continuous  local  clocks  are  assumed  to 
be  within  a  fixed  constant  e  of  each  other  and  to  stay  within  a  linear  envelope  of  real 
time.  Insight  into  how  these  two  models  are  related  would  enable  a  comparison  of  the 
bounds  that  have  been  obtained.  In  particular,  using  the  assumptions  of  our  model, 
for  what  values  of  their  parameters  can  their  model  be  implementable? 

•  We  have  given  a  straightforward  implementation  of  our  consensus  algorithm  using 
bounded  capacity  message  links.  Can  a  more  involved  approach  avoid  merely  effectively 
increcising  the  delay  of  each  message? 

•  For  other  problems,  can  bounded  capacity  message  links  be  used  to  control  implicitly 
message  complexity  by  causing  message  inefficiency  to  be  manifested  as  time  ineffi¬ 
ciency? 
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